README file for the Annotation WG File Final Build 37 Release I. Description of the dataset -------------------------------- This reference dataset includes annotations for the WES Release 3 (Atlas Only) Dataset, WGS V1 Dataset. The dataset can be used to annotate single variant association results, or to generate variant groupings by functional annotation for use in SeqMeta and similar tools. The dataset includes the following information: * The Ensembl VEP80 Predicted consequence and associated information for all relevant Ensembl transcript models. * A variant's 'most damaging' VEP80 Predicted consequence for each gene it falls within, computed across all transcripts, across all non-NMD transcripts, and across all protein coding transcripts * LOF Annotations from SnpEff (4.2) provided by Xueqiu Jian * CADDv1.2 raw and phred-normalized scores, allele frequencies from the ExACr0.3 and Kaviar-150810-Public resources. * CATO scores of predicted transcription factor occupancy: http://www.mauranolab.org/CATO/ * Allele presence/absence in the Wellderly Cohort: https://www.stsiweb.org/translational-research/genomic-medicine/wellderly/ * FANTOM5 Expressed Enhancers mapped to local genes via GTEx eQTL associations II. Dataset version number -------------------------------- 3.0 2017_1231 III. Files in the dataset -------------------------------- WES_release3AtlasOnly_vep80_most_severe_consequence_per_gene.txt.gz - SNV VEP80 output collapsed to provide the 'most damaging' consequence for each variant-gene mapping WES_release3AtlasOnly_rolling_flat_annotation.txt.gz - Additional annotations that can be assigned per variant (irrespective of the variant-transcript or variant-gene mapping) WGS_v1_vep80_most_severe_consequence_per_gene.txt.gz - SNV VEP80 output collapsed to provide the 'most damaging' consequence for each variant-gene mapping WGS_v1_rolling_flat_annotation.txt.gz - Additional annotations that can be assigned per variant (irrespective of the variant-transcript or variant-gene mapping) IV. Contributor -------------------------------- William S. Bush V. Workflow description -------------------------------- 1. WGS, and WES files provided by the ADSP QC Working Group were processed using VEP80 (with the --everything flag) 2. Variants affecting multiple transcripts of the same gene were further processed to generate a 'most damaging' consequence for each affected gene. This process uses the ranking table specified in the file 'ranking_table.txt' to identify the 'most damaging' consequence and to assign an impact score, which down-weights consequences for non-sense mediated decay transcripts and non-coding transcripts. 3. Variants are matched by chromosome, position, reference allele, and alternate allele to other external resources. 4. Seq-meta identifiers are created for each variant according to the rules outlined by the ADSP QC group (chr:pos:ref:alt). VI. Input files -------------------------------- Resources accessed to create these annotation files include: 1. Ensembl VEP 80 and the Ensembl Core Database (version 80) 2. CADD version 1.2 (http://cadd.gs.washington.edu/download) 3. ExAC release 0.3 (http://exac.broadinstitute.org/downloads) 4. Kaviar version 150810-Public (http://db.systemsbiology.net/kaviar) 5. CATO version 1.1 (http://www.uwencode.org/proj/Maurano_et_al_func_var in conjunction with Matt Maurano) 6. SWGR v1.0 (http://www.stsiweb.org/wellderly/) 7. FANTOM5 (http://fantom.gsc.riken.jp/5/) further mapped to genes using GTEx Associations V6 (https://www.gtexportal.org/home/datasets) VII. File Contents -------------------------------- 1. WES_release3AtlasOnly_vep80_most_severe_consequence_per_gene.txt.gz COLUMN DESCRIPTION 1 Chromosome 2 BP-position (1-based) relative to the chromosome 3 Alternate allele reported in the VCF file 4 A variant identifier compatible with SeqMeta (CHR:POS:REF:ALT) 5 A variant identifier compatible with EPACTS (CHR:POS_REF/ALT) 6 The most relevant gene symbol (based on Ensembl annotation) 7 Ensembl gene identifier 8 The most severe consequence for this gene according to the Ensembl ranking 9 The Ensembl Impact assessment for this variant consequence 10 The most severe consequence for any non-NMD transcript for this gene 11 The Ensembl Impact assessment (excluding NMD transcripts) 12 The most severe consequence relative to protein coding transcripts for this gene 13 The Ensembl Impact assessment (for protein coding transcripts only) 14 Occurance in the SNPEff Loss of Function Annotation 15 The SNPEff predicted consequence for Loss of Function Annotations 16 The number of transcripts affected by the SNPEff LOF consequence (relative to the Ensembl transcript set) 17 The percent of transcripts for this gene affected by the SNPEff LOF consequence (relative to the Ensembl transcript set) 2. WES_release3AtlasOnly_rolling_flat_annotation.txt.gz COLUMN DESCRIPTION 1 Chromosome 2 BP-position (1-based) relative to the chromosome 3 Alternate allele reported in the VCF file 4 A variant identifier compatible with SeqMeta (CHR:POS:REF:ALT) 5 A variant identifier compatible with EPACTS (CHR:POS_REF/ALT) 6 CADD raw score, 7 CADD phred score 8 ExAC Allele Count 9 ExAC AFR Allele Count 10 ExAC AMR Allele Count 11 ExAC Adjusted Allele Count 12 ExAC EAS Allele Count 13 ExAC FIN Allele Count 14 ExAC Adjusted Heterozygote Count 15 ExAC Adjusted Homozygote Count 16 ExAC Non-Finnish (NFE) Allele Count 17 ExAC Other (OTH) Allele Count 18 ExAC South Asian (SAS) Allele Count 19 ExAC Allele Frequency for each ALT allele 20 ExAC Total alleles in called genotypes 21 ExAC AFR chromosome count 22 ExAC AMR chromosome count 23 ExAC Adjusted chromosome count 24 ExAC EAS chromosome count 25 ExAC FIN chromosome count 26 ExAC Non-Finnish chromosome count 27 ExAC Other chromosome count 28 ExAC South Asian chromosome count 29 ExAC BaseQRankSum - Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities 30 ExAC ClippingRankSum Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases 31 ExAC Approximate Read Depth 32 ExAC Phred-scaled p-value using Fisher's exact test to detect strand bias 33 ExAC GQ MEAN 34 ExAC GQ STDDEV 35 ExAC Heterozygous AFR 36 ExAC Heterozygous AMR 37 ExAC Heterozygous EAS 38 ExAC Heterozygous FIN 39 ExAC Heterozygous NFE 40 ExAC Heterozygous OTH 41 ExAC Heterozygous SAS 42 ExAC Homozygous AFR 43 ExAC Homozygous AMR 44 ExAC Homozygous EAS 45 ExAC Homozygous FIN 46 ExAC Homozygous NFE 47 ExAC Homozygous OTH 48 ExAC Homozygous SAS 49 ExAC Inbreeding Coefficient 50 ExAC MQ 51 ExAC MQ0 52 ExAC MQRankSum 53 ExAC NCC 54 ExAC NEGATIVE_TRAIN_SITE 55 ExAC QD 56 ExAC ReadPosRankSum 57 ExAC VQSLOD 58 ExAC culprit 59 ExAC DP_HIST 60 ExAC GQ_HIST 61 Kaviar Allele Frequency 62 Kaviar Allele Count 63 Kaviar Allele Total over All Data Sources 64 Kaviar Data Sources Containing Allele 65 CATO score percentile: a computational prediction of transcription factor occupancy (ranges from 0 to 1 with 0 being no activity) 66 CATO motifs affected: a list of transcription factor motifs potentially affected by a variant 67 CATO cell types affected: a list of cell types with DNAse Hypersensitivity regions potentially affected by a variant 68 Allele Occurance in the Scripps Wellderly Genome Resource (SWGR) 69 Allele Presence in FANTOM5 Enhancer and the gene it putatively regulates (enhancerstart-enhancerend:EnsemblGeneID) 3. WGS_v1_vep80_most_severe_consequence_per_gene.txt.gz COLUMN DESCRIPTION 1 Chromosome 2 BP-position (1-based) relative to the chromosome 3 Alternate allele reported in the VCF file 4 A variant identifier compatible with SeqMeta (CHR:POS:REF:ALT) 5 A variant identifier compatible with EPACTS (CHR:POS_REF/ALT) 6 The most relevant gene symbol (based on Ensembl annotation) 7 Ensembl gene identifier 8 The most severe consequence for this gene according to the Ensembl ranking 9 The Ensembl Impact assessment for this variant consequence 4. WGS_v1_rolling_flat_annotation.txt COLUMN DESCRIPTION 1 Chromosome 2 BP-position (1-based) relative to the chromosome 3 Alternate allele reported in the VCF file 4 A variant identifier compatible with SeqMeta (CHR:POS:REF:ALT) 5 A variant identifier compatible with EPACTS (CHR:POS_REF/ALT) 6 CADD raw score, 7 CADD phred score 8 ExAC Allele Count 9 ExAC AFR Allele Count 10 ExAC AMR Allele Count 11 ExAC Adjusted Allele Count 12 ExAC EAS Allele Count 13 ExAC FIN Allele Count 14 ExAC Adjusted Heterozygote Count 15 ExAC Adjusted Homozygote Count 16 ExAC Non-Finnish (NFE) Allele Count 17 ExAC Other (OTH) Allele Count 18 ExAC South Asian (SAS) Allele Count 19 ExAC Allele Frequency for each ALT allele 20 ExAC Total alleles in called genotypes 21 ExAC AFR chromosome count 22 ExAC AMR chromosome count 23 ExAC Adjusted chromosome count 24 ExAC EAS chromosome count 25 ExAC FIN chromosome count 26 ExAC Non-Finnish chromosome count 27 ExAC Other chromosome count 28 ExAC South Asian chromosome count 29 ExAC BaseQRankSum - Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities 30 ExAC ClippingRankSum Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases 31 ExAC Approximate Read Depth 32 ExAC Phred-scaled p-value using Fisher's exact test to detect strand bias 33 ExAC GQ MEAN 34 ExAC GQ STDDEV 35 ExAC Heterozygous AFR 36 ExAC Heterozygous AMR 37 ExAC Heterozygous EAS 38 ExAC Heterozygous FIN 39 ExAC Heterozygous NFE 40 ExAC Heterozygous OTH 41 ExAC Heterozygous SAS 42 ExAC Homozygous AFR 43 ExAC Homozygous AMR 44 ExAC Homozygous EAS 45 ExAC Homozygous FIN 46 ExAC Homozygous NFE 47 ExAC Homozygous OTH 48 ExAC Homozygous SAS 49 ExAC Inbreeding Coefficient 50 ExAC MQ 51 ExAC MQ0 52 ExAC MQRankSum 53 ExAC NCC 54 ExAC NEGATIVE_TRAIN_SITE 55 ExAC QD 56 ExAC ReadPosRankSum 57 ExAC VQSLOD 58 ExAC culprit 59 ExAC DP_HIST 60 ExAC GQ_HIST 61 Kaviar Allele Frequency 62 Kaviar Allele Count 63 Kaviar Allele Total over All Data Sources 64 Kaviar Data Sources Containing Allele 65 CATO score percentile: a computational prediction of transcription factor occupancy (ranges from 0 to 1 with 0 being no activity) 66 CATO motifs affected: a list of transcription factor motifs potentially affected by a variant 67 CATO cell types affected: a list of cell types with DNAse Hypersensitivity regions potentially affected by a variant 68 Allele Occurance in the Scripps Wellderly Genome Resource (SWGR) 69 Allele Presence in FANTOM5 Enhancer and the gene it putatively regulates (enhancerstart-enhancerend:EnsemblGeneID) 5. ranking_table.txt COLUMN DESCRIPTION 1 VEP consequence prediction label (can be a composite of multiple consequences) 2 Original VEP ranking position 3 Modified ranking position 4 Original VEP consequence impact (HIGH, MODERATE, LOW, MODIFIER) 5 Modified consequence impact (HIGH, MODERATE, LOW, MODIFIER)