The following are some helpful resources for general information about cancer:
Helpful Cancer Genomics resource:
The mutation-based visualization features are derived from open-access MAF files produced by GDC variant-calling pipelines.
Data analysis and visualization features are only available for projects which maintain open-access MAF files. Programs such as TARGET maintain controlled-access MAF files only. As such, data analysis and visualization cannot be applied to TARGET projects.
The IMPACT is categorized by the Sequencing Ontology type of the variants that is also compatible to snpEff. The VEP IMPACT rating is a separate rating given for compatibility with other variant annotation tools (e.g. snpEff). Basically, each category is associated with a set of SO terms:
Details about predicted data in variations are available at ENSEMBL
Survival analysis in the GDC uses a Kaplan‑Meier estimator:
Please refer to the GDC Data Portal User's Guide Projects for additonal information
Yes. The GDC provides additional analysis endpoints to retrieve data sets associated with visualizations. Analysis endpoints include: survival, top_cases_counts_by_genes, top_mutated_genes_by_project, top_mutated_cases_by_gene, top_mutated_cases_by_ssm, and mutated_cases_count_by_project.
Please refer to the GDC API User's Guide Analysis Section for additional information.
Please refer to the GDC MAF File Specification to obtain detailed information on the format of GDC MAF files.
The “# Mutations” column in the Project or Exploration/Gene tab displays the number of distinct (unique) mutations within the affected cases and not necessarily the total number of all mutations within the project or query filter.
Within the GDC data analysis workflow, both public (somatic) MAFs and protected MAFs generated are from the same pipeline and link back to the same cases. For example, For the TCGA-GBM project, the somatic MAF has the following header:
# in TCGA.GBM.muse.7e85de23-3855-4279-a3ac-a81827e4ccb6.DR6.0.somatic.maf.gz
#version gdc-1.0.0
#filedate 20170307
#n.analyzed.samples 393
In general, n.analyzed.samples is used as a denominator to calculate mutation frequencies. If no variants for a case passed our filters, the case should still be counted; however, if the case was determined to have poor quality (such as for high contamination, duplicates etc.), it is not counted in the public MAF. In this particular project (TCGA-GBM), there were 396 cases with SNV data. Our analysis pipeline revealed that among them, a total of 5 GBM tumor aliquots had high contamination. Among these 5 patient, 2 had another good tumor aliquot, but 3 had only one aliquot. As the result, those 3 cases were removed from the public MAF.
The GDC is not normalizing frequency by gene length. This is currently under discussion. As such, these genes are appearing in the mutated genes table. Users can filter by the COSMIC Cancer Gene Census to display only genes for which mutations have been causally implicated in cancer.
The cases in the OncoGrid are filtered by consequence type. Only cases that have mutations that have consequence types of: {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained} are displayed in the OncoGrid.
To search for a mutation, you can utilize the Quick Search bar at the top right portion of the GDC Portal by entering in either a dbSNP reference cluster ID (rs#) or the coordinates of the chromosomal change. For example entering in 'rs121912651' or 'chr17:g.7674221G>A' will bring the user to the mutation entity page for that mutation.
There are less cases displayed with mutations in the 'Top Mutated Cancer Genes in Selected Projects' on the Project List Page because there is a filter on cases that have mutations on 1) Genes in the Cancer Gene Census and 2) Mutations with consequence types of {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained}.
Mutation frequency in the context of the OncoGrid represents total mutation occurrences in the gene (total count), while the # of Mutations listed on other portions of the GDC Portal represents the number of unique mutations on a gene or within a particular cohort.
Target capture kits are used to "target" specific regions of a given genome for the Whole Exome Sequencing (WXS) and Targeted Sequencing experimental strategies. Users should therefore take care when comparing data from different target capture kits for the WXS and Targeted Sequencing experimental strategies because of potential differences in genomic regions targeted, variant filtering, and subsequent variants recovered. Additionally, users should also consider that Whole Genome Sequencing (WGS) does not use target capture kits and may thus recover variants that are excluded in the WXS and Targeted Sequencing experimental strategies.
From the GDC Data Portal "Repository" page, the names of the target capture kits can be accessed by clicking the "Adding a File Filter" link and choosing "analysis.metadata.read_groups.target_capture_kit" from the menu. The GDC does not distribute target or bait files because they are intellectual property. Users should contact each individual project directly if detailed target capture region information is needed.
The GDC does not perform batch effect corrections across samples for the following reasons:
As such, the GDC prefers that users to perform their own batch effect removal.
There is a variety of target capture kits used by different sequencing centers. Most of the whole exome capture kits share many common genomic regions, especially for cancer related genes; However, which exons are included is totally dependent on the vendor's library preparation kit. There are often more differences among capture regions from different Targeted-Sequencing/Panel data.
The name of the capture kit used is available from the GDC read group properties. However, the GDC does not distribute the BED files for the read groups associated with these capture kits because some of them are proprietary.
For the reference genome, the GDC has been using an augmented version of GRCh38.p2 (with additional decoy sequences and virus sequences) since inception. The GDC does not use alternative contigs, and only derives high-level data from the major chromosomes, so the same reference genome is used for both gene model GENCODE v22 (from Data Release 1 to 31) and GENCODE v36 (from Data Release 32). As future versions of the reference genome are released, e.g., GRCh39, the GDC will evaluate the benefits of updating data to utilize the new version. By updating the reference genome, the GDC would expect to re-process all data sets. For information on the reference genome used by the GDC, please refer to the GDC Reference Files.
For workflow updates, the GDC prefers to keep the workflow stable, and will not update unless there are necessary updates such as updates of the reference genome or gene model, or major algorithm updates in the tools that could result significant changes in the generated data. When workflow updates are actually needed, the GDC categorizes them as either major updates or minor updates depending on whether the update significantly affects the output data. The GDC will re-process all existing data sets in major workflow updates, and such examples include transitioning the RNA-Seq genomic BAM alignment workflow into a new version that generates three BAMs and STAR counts; and updating the MAF workflow to add additional functions to the MAF files. Minor updates mostly happen to resolve bugs, security issues, and/or compatibility issues. For example, the GDC DNA-Seq alignment workflow has been updated several times to address quality issues from various submitted data; however, because the main alignment algorithm remains almost the same, the GDC does not need to re-process all the data sets for these minor updates.
HTSeq had been the default RNA-Seq expression quantification tool since the first GDC data release. The GDC later updated the RNA-Seq alignment and quantification workflow to include STAR Count, which generates stranded counts by default in addition to the existing unstranded counts. During Data Release 32 for gene model updates, the GDC had 1) augmented the existing STAR Count output to include FPKM and FPKM-UQ normalizations; 2) reprocessed all the TCGA data using the latest RNA-Seq workflow with STAR Count. Because both tools use very similar counting strategies, and STAR Count has the advantages in both running time and the additional stranded counts, the GDC removed HTSeq workflow in Data Release 32.
Any germline SNP calls are not available for exploration in the GDC Data Portal. Instead, alignments for germline data are available under controlled access. Users with appropriate access may use the alignments to generate germline variants.
Some somatic variants callers, such as MuTect2, also output somatic calls with some level of germline possibilities, such as those labelled as "germline_risk". Please note that these calls are, by no means, germline variants. They are somatic calls with boundary probability of germline risks.
The SomaticSniper whole exome variant caller was one of the first generation somatic mutation callers developed by the scientific community. It works the best with blood cancer that has high level of tumor-in-normal contaminations, but is often overly permissive for solid tumors. Since our first data release in 2016, the GDC has gradually adopted newer tools or new tool versions, and has transited the focus of somatic variant calling from any single caller to multi-caller ensemble.
After comparing ensemble calls with and without SomaticSniper and also receiving feedback from the authors of SomaticSniper, the GDC decided to remove this tool from our production in Data Release 35. The GDC still maintains other four whole exome variant callers, including MuSE, MuTect2, Pindel, and VarScan2.
Generally any WGS data should have associated structural variant files (BEDPE) except in the cases in which either there are no tumor/normal matches or when variant calling has not been implemented yet.
To view the top frequently mutated genes for a cohort, first build a cohort using the Cohort Builder and then select the Mutation Frequency tool in the Analysis Center