Genomics Facility Projects LIMSUser Guide

Data Analysis

Data analysis services may added to a project by choosing the deliverable “Custom Bioinformatics” under “Project Input/Output Details”. Or, to start a project that consists only of data analysis of pre-existing data, choose project of type “Data_Analysis”. If applicable, please provide a link to the appropriate reference genome in the “Project Species Details” window.

Whole Genome Assembly

If your study species does not have a suitable reference, we can perform a rudimentary whole genome assembly based on paired end sequencing data from ~400-500bp libraries. The resulting rough assembly will have tens of thousands of contigs, but will be adequate for the purpose of calling SNPs from WGS skim sequencing or preexisting raw GBS sequence data. Our assembly pipeline uses FLASH to estimate fragment (insert) length, k-mer analysis via Jellyfish to estimate genome size, Trimmomatic to trim adapter and low quality sequences, SOAPdeNovo2 to perform the assembly, and QUAST for assembly metrics. The deliverables include a fasta file containing the rudimentary assembly and a report providing the exact commands used and summarizing the results from each step.

SNP Calling from WGS or Skim WGS Data

For calling SNPs from whole genome sequencing data, we use a GATK based pipeline using Picard to mark Illumina adapters and duplicates, BWA-MEM to align to the reference genome, and GATK to realign indels, identify local haplotypes and call SNPs. Filtering of the resultant genotypes is then performed with VCFTools. The degree of missing data and heterozygote under-calling will depend on the genome size, sequencing coverage and inbreeding level. Imputation is often successful for outbreeding populations using Beagle 4.1 (available on request). The optimal scenario for genotyping from low coverage (<1x), skim sequencing is to have a panel of representative haplotypes available from higher depth sequencing, especially if these are the parents or founders of your study population.

The deliverables include the unfiltered and filtered genotypes, and a report containing the exact commands used for each step.

Quantitative Expression Analysis of 3’RNA-seq Data

Our quantitative expression analysis pipeline for 3’RNA-seq data trims adapter and low quality sequences with Trimmomatic, aligns to the reference genome (fasta) with STAR aligner, and uses count.py from HTSeq to get raw expression counts for each gene in a GTF file. The raw counts are then transformed into normalized counts using DESeq2. Deliverables include the raw and normalized counts and the exact commands used. Additional custom differential expression analyses and data visualizations via DESeq2 are available upon request.

SNP Calling from 3’RNA-seq Data

Utilizes the same pipeline as SNP calling from WGS data except that the “intron-aware” STAR aligner is used in place of BWA-MEM. The resultant SNPs are filtered for depth and presence across most of the samples, so that they derive from near-universally expressed genes. The deliverables include the unfiltered and filtered genotypes, and a report containing the exact commands used for each step.

SNP Calling From GBS Data

Data analysis services may be requested for GBS sequence data that you already have on hand by starting a new project of type “Data_Analysis”. Please enter a brief description of your project (e.g., “GBS SNP calling in a association mapping panel”) in the “Project Description” box, and a link to the appropriate reference genome in the “Project Species Details” window. If a reference sequence for a related species is entered, this species should be no more than ~5% diverged from the species we will be analyzing. If a suitable reference genome is not available enter “none” in the “Link to Reference Genome” field. Then enter the number of (96-well) plates worth of samples and choose “Custom Bioinformatics” under “Project Input/Output Details”.

Two different public pipelines are used to call GBS SNPs:

(1) TASSEL3-GBS (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090346) that uses a reference sequence to identify single copy loci for SNP calling. SNPs are then called among members of the population submitted for genotyping. Because SNPs are not called against the reference sequence, the major alleles present in your population may or may not match the corresponding reference sequence alleles.

(2) UNEAK (http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003215) a network-based SNP calling pipeline that does not require a reference genome sequence. It is conservative in that it looks for pairs of 64 base tag sequences that differ by a single base (the SNP). If your DNA samples are significantly contaminated from non-target sequences (e.g., microbes, nematodes, pathogens, endophytes, etc.), then the UNEAK pipeline can call SNPs from those species as well, without discriminating. Hence, we recommend using the reference-based GBS pipeline wherever possible.

GBS Genotype Deliverables

The analyzed data consist of an analysis report (pdf), a keyfile that relates barcodes to sample IDs, SNP calls per DNA sample (unfiltered and filtered SNPS are provided in vcf format), a SAM file (a list of sequence tags and their alignment coordinates) if a reference sequence was used for analyzing data, a TagsOnPhysicalMap (TOPM) instead of a SAM file if the non-reference UNEAK pipeline was used (containing the tag pairs and their fake genome coordinates), text files with heterozygosity estimates and sample level DNA QC data (number of sequences obtained per DNA sample). The analysis report contains detailed explanations of the files included and analysis procedure along with sequence alignment (if applicable), coverage and missingness statistics, and an MDS plot of genotypes.

GBS Genotype Data filtering

We provide both unfiltered and filtered SNPs (using thresholds of minor allele frequency (MAF) >1% and site missingness < 20%). This basic filtering eliminates many sequencing errors and low coverage loci. However, these filtering parameters are not appropriate for all populations and do not eliminate all error prone SNPs or SNPs called from alignment of paralogs (i.e., loci with very high heterozygosities). Therefore, you will need to re-filter SNPs using TASSEL v5.0 (http://www.maizegenetics.net/tassel), VCFTools (https://vcftools.github.io/index.html) or some other means, using criteria that are most biologically relevant for the population genotyped (i.e. use a MAF appropriate for your particular mapping population and eliminate overly heterozygous loci in diploid species).

Version 3 Updated 2017-05-03