Variant Calling at the GDC

The GDC currently applies multiple somatic variant calling algorithms (or “pipelines”) to the DNA-seq data it accepts. Rather than choosing a “best call” from among these, the GDC provides users with the mutations output by all pipelines. An explanation of the rationale for this decision, with some brief background on the state of the art in somatic mutation calling, will help users in their interpretation of GDC variant calls.

GDC Objectives

A primary goal of the GDC is to provide the scientific community with high quality, standardized and well-documented sequence data for every project and dataset that it hosts. The objectives supporting this goal can be summarized in the following list of desirable factors:

Quality - Results are internally consistent, replicate known results where relevant, and meet well-defined and well-accepted community quality standards where such standards are available.
Currency - Pipelines as implemented reflect the latest recommendations of the computational genomics community of experts.
Comparability - Pipelines can be executed over all submitted sequence data for a given data type, and yield data that can be directly compared across all underlying sources of data.
Transparency - Pipelines are documented in sufficient detail to afford scientific critique and reproducibility.
Computability - Pipelines can be executed without extensive human intervention, to enable high volume, automated throughput.
Interoperability - Pipeline results are in formats that are easily used as inputs to current genomics software tools.

There are two corollaries to this list. (1) The GDC strives to track current publicly available bioinformatics technologies to produce high-quality results, but not authoritative results. The GDC pipelines are standardized, but the GDC does not purport that its variant calls are “standard”. (2) Algorithms implemented in GDC are not generally novel, but are those that have found wide support within the genomics expert community, generally through repeated validation of many variant calls by independent, laboratory-based means in published studies.

One consequence of this philosophy is that if the community really is at equipoise regarding the accuracy and precision of two or more prevailing methods, the GDC can choose to run more than one of these methods (subject to resource constraints) and provide users with the output of each method as separate sets of results. In this way, scientific decisions are left in the hands of the user. This also provides the results of multiple pipelines run on the same primary data, which can be used to compare the methods in detail to assist in evaluating the differences among multiple widely used methods.

Overview of Somatic Variant Calling

Tumor genomic variant calling is currently a highly active area of research. Diverse computational and statistical approaches are being explored to increase the sensitivity and specificity of variant identification at all genomic scales, from single nucleotide variants to chromosome structural variants. A high-level overview of variant calling can give a sense of the complexity of the problem and the reasons why a single “best caller” might not yet exist.

Sources of Error in a Calling Pipeline

A typical mutation calling pipeline will employ software programs, modules or functions that perform the following broad steps [3]:

Raw read alignment to a genome reference,
Pre-calling alignment recalibration (“co-cleaning”),
Raw variant calling,
Post-call quality assignment, and
Post-call variant filtering.

Each step can introduce errors that combine to affect the ultimate likelihood that a given call reflects the presence of a somatic mutation in the sample, or that the absence of a call reflects the absence of any mutation.

The latest reference alignment + mutation calling algorithms are generally based on an underlying probabilistic error model incorporating some or all of these steps (e.g., [4]). Introduced errors can decrease both sensitivity (increase false negative “calls”) and specificity (increase false positive calls). The algorithms use models in an attempt to measure and/or control these errors, and on the basis of these measurements assign quality scores to each variant call. Variant quality scores are used in post-processing to decide whether (or at what level of confidence) a variant is reported as final data.

Limits on the power to detect the presence of a given variant in a given sample and cohort put an effective cap on the sensitivity of any algorithm applied to a given dataset. Power to detect a given variant increases with sequencing depth (within samples) and true frequency of the variant in the cohort being sequenced (the “variant allelic fraction” among samples) [5]. These two variables represent a large portion of the time and cost of a study, and trade off against the desired focus of a study in its initial design. For example, identification of low frequency cancer driver mutations is the goal of NCI's Cancer Driver Discovery Program (CDDP) -- an example of an attempt to improve sensitivity at the level of study design.

Specificity can be strongly influenced by number of factors besides depth and frequency. These include undesirable biochemical reactions during nucleic acid preparation (oxidation of guanine during shearing, [2]), genomic context (e.g., insertions/deletions in low complexity contexts [6]), and the presence of multiple reference targets of high similarity [7]. These processes have variable influence on false positive rates, depending on the variant type (e.g., Single Nucleotide Variant [SNV], Insertion and Deletion [INDEL], chromosomal or transcriptional breakpoint variation). Taking into account sequencing depth, observed allelic frequency, and probabilistic models of these processes, various algorithms have been developed to place quantitative bounds on the likelihood of a given called variant being truly present in a sample.

In cancer genomics, the detection of somatic, tumor-specific mutations is also complicated by the fact that many collected samples are not entirely tumor, but contain some fraction of adjacent normal cells. Models of the effect of the presence of normal genome contamination in samples on variant calling have been incorporated into some tools [8], and methods of normal fraction estimation based on sequence data alone have also been devised [9].

Post-call variant filtering is a key final step in current pipelines. Variant quality values are estimates of the likelihood that a called variant corresponds to a true variant given the model and its parameters. To the extent that biological, laboratory, and computational processes depart from model assumptions, algorithms can overestimate the quality of their variant calls [10]. If the discrepancy can be empirically measured (using validation datasets), quality cutoffs can be used as parameters in filtering to provide an estimate of false detection rate or other confidence measure.

Beyond exclusion of variants that do not meet a given quality, current pipelines incorporate empirical lists of artifactual variants that are frequently observed by manual curation. Sometimes called “blacklists” [1], these sets of artifacts are often developed by sequencing and analysis teams at the centers themselves, and reflect idiosyncrasies of the local wet- and dry-lab standard operating procedures. Samples from particular tumor types have been observed to produce recurring artifacts [5], so that filters can even be disease- or project-specific to some extent. The caveat with respect to such filters is that they reflect only current understanding and algorithmic sophistication; there is always a non-zero probability that an excluded called variant does in fact reflect a true mutation.

Validation of Variant Calling Pipelines

The field is just beginning to develop deep and numerous enough sequence datasets to be able to rigorously validate pipelines and estimate their true sensitivity and specificity. Comparisons [1] and “bake-offs” [11] have been and are being performed to identify the best algorithms and stimulate improvements. The results of these challenges are suggesting that different algorithms are better at calling different types of variants, and that a single best pipeline is not yet available. Even the methods of validating variant calls suffer from biases (e.g., most callers perform better on simulated sequence data than on real genomic validation data); this is itself an area of ongoing work.

To “bootstrap” caller improvement in the absence of extensive gold standard data, methods of combining the outputs of different pipelines into a set of “consensus calls” have been developed and used. These can run from a simple majority rule among >=3 algorithms, to more complex probabilistic calling protocols that take into account the performance of callers on different types of variants and data [12]. These methods, as usual, are susceptible to departures from assumptions.

Selecting GDC Pipelines

Somatic genomic variant detection is still evolving, and promises to do so for some time to come. The adoption of variant artifact lists by many teams indicates that, even with many sophisticated, model-based tools, empiricism and manual inspection of automated calls remains part of rigorous variant calling, while we anticipate algorithm refinements that are able to detect such artifacts automatically.

Pragmatic tradeoffs between quality and throughput must be made, if the GDC is to apply standardized analysis to all datasets submitted. Since quality is one of several factors to optimize within the GDC objective framework, the GDC attempts to strike a balance. For instance, sequence depth and frequency place upper bounds on quality, so that small potential improvements using more time-consuming algorithms may not render any additional strongly supported variants in a given dataset. Little manual curation is possible currently by GDC personnel alone (although crowd-sourcing curation is a possibility for the future). Therefore, GDC pipelines must be essentially entirely automated, relatively fast, and have filters that are somewhat coarsely tuned to the GDC computational process. However, since both currency and transparency are also key factors, GDC currently accepts the uncertainty inherent in the current state of calling and validation technologies, and publishes the outputs of multiple pipelines rather than attempting to create a consensus.

The GDC implements the latest versions of the pipelines developed by key Genome Sequencing Centers (Washington University, Baylor College of Medicine, MD Anderson Cancer Center, the Broad Institute, and the Wellcome Sanger Institute). Each of these variant callers have been employed in ever-improving versions in the sequence analysis of all TCGA tumor projects, and their results have been subject to the intense study and scrutiny of the cancer biology community. In collaboration with the GDC Bioinformatics Advisory Group and Steering Committee, the GDC has chosen for now both to implement these calling pipelines and also to provide the result sets separately to users. During implementation, GDC bioinformaticians and engineers have taken advantage of generous assistance from the authors of each caller. Specific details of the GDC implementation of each pipeline may be found at this link.

The GDC believes that these pipelines are representative of the most reliable callers currently available. But, also in keeping with the GDC philosophy, the GDC will also maintain up-to-date knowledge of the state of the art, and will regularly confer with experts and listen to comments and critiques of the user community. This built-in process will lead to change and improvement of variant calling and other software pipelines.

References

1. Alioto, T. et al., A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat Commun. 6, 10001 (2015).

2. Costello, M. et al., Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 41, e67 (2013).

3. Nielsen, R., Paul, J., Albrechtsen, A. & Song, Y., Genotype and SNP calling from next-generation sequencing data. Nature Reviews. Genetics 12, 443–451 (2011).

4. Li, H., A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987-93 (2011).

5. Lawrence, M. et al., Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214-8 (2013).

6. Ye, K. et al., Systematic discovery of complex insertions and deletions in human cancers. Nat Med 22, 97-104 (2016).

7. Li, H. & Durbin, R., Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 15, 1754-60 (2009).

8. Cibulskis, K., Lawrence, M., Carter, S., et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology 31, 213-219 (2013).

9. Cibulskis, K. et al., ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics 27, 2601-2 (2011).

10. Li, H., Ruan, J. & Durbin, R., Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851-1858 (2008).

11. Ewing, A. et al., Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat Methods 12, 623-30 (2015).

12. Kim, S. & Speed, T., Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics 14, 189 (2013).

(Created on: February 26, 2016 • Last updated on: June 12, 2017)