You are here

Review and Proposed Actions for False-Positive Association Results in ADSP Case-Control Data


November 4, 2016

The problem

Some SNVs in the publicly released whole exome sequence (WES) QCed “consensus-called” data (which systematically integrated genotype calls from two pipelines: Atlas at Baylor College of Medicine and GATK at the Broad Institute) may have biased genotype calls resulting from sequence data generated/processed at the Broad Institute. This issue was identified by follow-up on likely “false-positive” genetic associations with genome-wide statistical significance in case-control analysis. It is not yet clear if this issue also affects WGS data.

Some of the affected variants are in regions with known sequence homology issues and an effect on genotype calling for variants in such regions is not unexpected.  However, differential effects between cases and controls leading to strong but spurious associations, despite cases and controls being sequenced in roughly equal numbers at each sequencing center, was not anticipated.

These spurious associations were identified by multiple channels, specifically: 

  1. Sanger sequencing was performed to validate heterozygous calls among positively-associated variants. This sequencing showed that genotypes called heterozygotes by sequencing could not be reliably confirmed by Sanger sequencing if “ABHet”, the average proportion of reference allele reads out of all reads, was >0.7 (i.e., more than 70% of reads are the reference allele).
  2. Association analysis of case-control genotypes with Alzheimer’s disease with covariate adjustment for time period of sample sequence processing at the Broad Institute appeared to eliminate most potentially spurious associations.

Of the limited number of variants followed up, most that were identified to be likely spurious had high ABHet ratios and were called only in the GATK pipeline in the QCed dataset.  This issue did not appear in Atlas calls in the QCed dataset because QC of the Atlas pipeline genotypes excluded variants/genotypes with ABHet ratios>0.75 (GATK QC did not) and because of different protocols for BAM file processing in the Atlas versus GATK pipeline (more details provided below).

Possible causes of the problem

While the cause of the problem is not certain, several potential causes during the production of GATK calls at the Broad Institute have been identified:

  1. At the Broad Institute, BAMs for samples sequenced at the Broad were not recalibrated in the same manner as BAMs for Baylor- and WashU-sequenced samples were prior to genotype-calling with GATK.
  2. Multiple versions of GATK and PICARD were used in the pipeline processing for samples sequenced at different times at the Broad.
  3. Definitions of known insertion-deletion polymorphisms (“indels”) were changed over the course of sequence processing/genotype calling at the Broad Institute, which could have influenced SNV calling as SNVs called in the vicinity of indels are subject to additional filtering. The Baylor/WashU BAMs were processed using a single set of then most current indel definitions.
  4. Different capture protocols were used at the different sequencing centers, and these rather than pipeline differences could result in biases. (Note: all variants included in the “consensus-called” dataset were in overlapping target regions of all capture protocols.)

None of these differences have been officially confirmed as causes, however all are still under investigation as potential sources of the observed bias.

It is important to note that the Atlas calls produced at Baylor were not affected by items 1-3 above because ADSP WES data were independently and uniformly reprocessed as part of the Atlas pipeline.

Proposed solutions: short term

ADSP will release the following data for investigators to address the potential bias over the next month:

  • An Atlas-only call set with project-level QC steps implemented up to the point at which Atlas and GATK calls were merged via “consensus calling”.
  • A dataset with indicators for versions of software and other protocol changes used in data processing.

While investigators should consider what is best for their needs, the ADSP plans to implement the following measures in its analyses:

  • Perform primary analysis with quality-controlled Atlas-only calls
  • To avoid potential false negatives from use of only one calling pipeline, perform a sensitivity re-analysis of the current Atlas/GATK consensus callset, adjusting for sequencing center and changing protocols over time.
  • Compare these analyses to the Atlas-only association results.This latter step is important because apparently validated genotypes from the consensus call set that were only called by GATK have surfaced as positive association results.
  • For all associated sites, check ABHet ratios for the variant and run BLAT searches of variant-flanking sequence (to determine if there are multiple regions with high homology in the genome) as queries to highlight variants with potential false-positive associations.
  • Continue to validate selected associated variants/sites by Sanger sequencing in sequenced samples called as heterozygotes.This will be done for variants included in both single-variant and gene-based analysis.

Proposed solution: long term

ADSP will remap all data to genome build hg38 and call variants by a consistent standardized protocol. The updated data will be released in 2017.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer