The following are some helpful resources for general information about cancer:
Helpful Cancer Genomics resource:
The NCI Genomic Data Commons (GDC) is the next generation repository and cancer knowledge base supporting the import and standardization of genomic and clinical data from cancer research programs (e.g. TCGA, TARGET, CGCI), the harmonization of sequence data to the genome / transcriptome, and the application of state-of-the art methods for derived data (e.g. mutation calls, structural variants, etc.).
The NCI part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services (DHHS) established the GDC to provide the cancer research community with a data service supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit About the GDC for additional information.
The primary goal of the GDC is to provide the cancer research community with a unified repository and cancer knowledge base supporting cancer genomic studies. The cancer knowledge base enables the identification of low-frequency cancer drivers, assists in defining genomic determinants of response to therapy, and informs the composition of clinical trial cohorts sharing targeted genetic lesions. Working towards this goal, the GDC provides resources supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit the GDC Overview for additional information.
The GDC welcomes collaborations with organizations conducting research in or providing informatics supporting cancer genomics. Organizations interested in collaborating with the GDC should contact GDC Support.
No. All GDC data can be used in publications or presentations. For additional questions about the use of GDC data, or to explore opportunities for collaboration, please contact GDC Support.
Please credit the NCI Genomic Data Commons (GDC) in your manuscript by citing the following paper about the GDC in your manuscript:
Grossman, Robert L., Heath, Allison P., Ferretti, Vincent, Varmus, Harold E., Lowy, Douglas R., Kibbe, Warren A., Staudt, Louis M. (2016) Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine375:12, 1109-1112
When citing individual projects, please refer to the attribution policies of the project when available.
The GDC provides several resources for querying and downloading data from the GDC including the GDC Data Portal for querying and downloading GDC data files, the GDC Data Transfer Tool for downloading large volumes of files, and the GDC Application Programming Interface (API) for performing programmatic queries and downloads.
Information on the data submission processes and tools are available on the GDC Data Submission Processes and Tools page. Detailed instructions for submitting data into the GDC are provided in the GDC Data Submission Portal User's Guide. Per GDC Policy, organizations interested in submitting data into the GDC must first apply for data submitter access through the NIH database of Genotypes and Phenotypes (dbGaP).
Please visit the GDC Data Types and File Formats for a list of the standard data types supported by the GDC.
The GDC is harmonized against GRCh38. Please see GDC Data Harmonization for additional information on the GDC pipelines for re-aligning genomic data.
The GDC generates high level data for germline and somatic genotyping, RNA-Seq quantification and structural analysis, SNP Array Genotyping and CNV Calls, and variant annotations. Please visit GDC Data Harmonization for additional information on the GDC high level data generation pipelines.
GDC data is stored in a secure FISMA-compliant facility. Access to controlled data requires authorization via dbGaP. See Data Access Processes and Tools for more information.
Generally, browsing indexed GDC metadata (such as information about the cases and files contained in the GDC Data Portal) does not require a login.
eRA Commons authentication and dbGaP authorization are required before accessing controlled data, which generally includes individually identifiable information such as low level genomic sequencing data and germline variants.
Controlled-access data users log in to the GDC using their eRA Commons accounts. The GDC then verifies that the user has authorization in dbGaP to access specific controlled datasets.
See Obtaining Access to GDC Data and Resources for more information on data download, and Obtaining Access to Submit Data for information on data submission.
The GDC provides helpdesk support for data submission and other issues. For information on the GDC helpdesk, please visit GDC Support.
Users can perform advanced SQL-like queries using the GDC Data Portal Search interface. Instructions on using the GDC Data Portal Search interface are available in the GDC Data Portal User's Guide.
System requirements for using the GDC Data Transfer Tool are available on the GDC Data Transfer Tool page. Additional details are in the GDC Data Transfer Tool User's Guide.
Once the project has been registered through dbGaP please contact the GDC Helpdesk for assistance with setting up a new project.
The GDC provides code examples in the GDC Application Programming Interface (API) User's Guide
The GDC maintenance window is semi-monthly occurring Saturday, from 8:00 am to 4:00 pm CST / 9:00 am to 5:00 pm EST.
The GDC Data Transfer Tool is recommended for transferring large datasets to or from GDC. For additional details, please visit the GDC Data Transfer Tool User’s Guide.
The GDC Data Transfer Tool does not offer a setting to limit the bandwidth it uses.
The GDC Data Transfer Tool uses sequential read/write for each file segment that is being transferred. By default, the tool executes multipart transfers, which results in multiple parallel, sequential read or write operations. To turn off multipart transfers, users can set the number of processes to 1.
GDC authentication tokens remain valid for 30 days.
The study and Subject IDs must be registered in dbGaP. For additional details, please visit: Obtaining Access to Submit Data.
The GDC validates genomic data (BAM files) using FASTQC and Picard. For additional details, please visit: GDC Data Harmonization.
Uploaded and validated data is put in a workspace until the user formally submits the data to the GDC. This allows users to interact with the data before submitting. Once the data is submitted, the GDC will process applicable datasets (e.g. harmonize molecular data and generate high level data). After processing has been completed, the data is made publicly available according to GDC Data Sharing Policies. The data becomes accessible through GDC tools (GDC Data Portal, GDC APIs) on open or controlled access basis according to the dbGaP authorization policies associated with the data set. For additional information, please visit: GDC Data Submission Processes and Tools.
Recurrent data releases allow the GDC to version the data and allow users to reference the GDC version number in publications. GDC currently generates releases as needed, with a release every 2-3 months as a goal.
The GDC employs a hierarchical data model which requires metadata and files to be attached only at particular nodes or points in the hierarchy. If you have questions, please review the GDC Data Model or contact GDC Support.
There has been long standing debate about prefixes for multiples of bytes. We have chosen to utilize the standard supported by the International System of Units (SI) where 1 gigabyte (GB) = 109 bytes or 1 megabyte (MB) = 106 bytes. This convention is also supported by the IEEE, EU, NIST, and the International System of Quantities. Where appropriate, we utilize the IEEE 1541 recommendations for binary representation where 10243 bytes = 1 gibibyte (GiB) or 10242 bytes = 1 mebibyte (MiB).
The GDC processes data through several harmonization pipelines. If the process of harmonization reveals issues in the underlying data or if an error occurred during harmonization, the harmonized data files (e.g. BAMs or VCFs) will not appear in GDC data access tools.
HIPAA guidelines require that patients with ages greater than 89 years be aggregated into a single age category. This is to limit the ability to positively identify these individuals. In practice this will impact the values reported in several fields. We have chosen to accurately display the age at diagnosis, but fields that give dates or time periods after this benchmark may be compressed. This may include such fields as "Days to last follow up", "Days to last known disease status", "Days to recurrence", "Days to death", and "Year of death". Other fields found in clinical supplement files may also be impacted. In the Data Portal you will only see ages of less than or equal to 90. Individuals over 89 will all appear as 90 years old.
The following web browsers are supported for use with the GDC Data Portal, Submission Portal, Website, and Documentation site.
The GDC provides access to both open and controlled datasets. To access controlled datasets, users must obtain appropriate authorization through dbGaP. See Obtaining Access to Controlled Data for instructions on applying for access through dbGaP.
The GDC Data Portal is a web-based application that is limited by browser and network constraints. If a system timeout occurs when downloading files, please use the GDC Data Transfer Tool or contact the GDC Help Desk.
The GDC Data Submission Portal checks XML, JSON, and TSV metadata files for validity at the time they are submitted. If your files fail to validate, please check the error report and review the GDC Data Dictionary for troubleshooting these errors. Additional information on supported files and formats can be found on the GDC Data Model and File Formats pages, and in the GDC Data Submission Portal User's Guide.
TCGA RPPA data is available in the GDC Data Portal.
DNA methylation data collected by TCGA has been harmonized using the SeSAMe pipeline and are available at the GDC Data Portal.
Capture kit information is provided by the GDC API at the read group level, where available. In some cases, additional information may be available in SRA XML files.
The relevant read_group
properties returned by the GDC API are:
target_capture_kit_name
target_capture_kit_catalog_number
target_capture_kit_vendor
target_capture_kit_target_region
The target_capture_kit_target_region
field provides a URL for the capture kit target file, distributed by the kit manufacturer or by the research program. Bait/probe files can sometimes be found at the same URL; or a URL to the bait/probe file may be available in the SRA XML file.
Note: Some BAM files include information from multiple read groups, and sometimes read groups produced with different capture kits are included in the same BAM file. Tools are available for splitting BAM files into read groups, e.g. bamutil.
Note: Target and bait/probe files may use an older reference genome, so liftover may be required for certain applications.
Harmonized BAM files from RNA-seq and DNA-seq experiments will contain both mapped and unmapped reads, if available. Unmapped reads are not distributed separately.
Raw sequencing files submitted to the GDC are processed using GDC Genomic Data Alignment pipelines. The processed data are made available in the GDC Data Portal as BAM files containing aligned reads and unmapped reads (if available). No reads are hard-clipped, but reads that were flagged as "failed" during an Illumina sequencing run are discarded.
Third-party tools such as biobambam2 or Samtools fastq can convert these files to FASTQ sequencing data. Note that DNA-Seq quality scores are modified during the score recalibration co-cleaning step, so third-party tool parameters must be set to retrieve the original scores (biobambam2: tryoq=1
; samtools fastq: -O
). Because GDC harmonized BAM files may contain multiple read groups, the conversion parameter should be set to retain read group IDs in the generated FASTQ files (biobambam2: outputperreadgroup=1
; samtools: samtools split
).
Some of the reasons particular mutations may have been removed include updates to third party databases, more conservative germline-masking rules by the GDC, and different mutation calling pipelines and versions. Despite these differences, the GDC recaptures over 97% of TCGA-validated variants in the controlled-access MAF files. The GDC suggests using controlled-access MAF files if important variants cannot be found in somatic MAF files.
The file detail page and the metadata files accessible from that page (if available) can be used to determine the difference between files that share the same filename. For example, the files may be associated with different aliquots, or different patients.
BAI files are included with the download when using the GDC Data Transfer Tool to download BAM files.
When using the API to download BAM files, BAI files will only be included if the related_files=true parameter is specified together with the BAM UUID, for example:
https://api.gdc.cancer.gov/data/53f4ad60-0777-409c-a34d-ca4442dc9c44?related_files=true
Alternatively, users can determine the BAI file UUID from the API files endpoint by supplying the BAM UUID. The BAI file UUID can then be used to download the BAI file from the data endpoint.
https://api.gdc.cancer.gov/files/53f4ad60-0777-409c-a34d-ca4442dc9c44?pretty=true&expand=index_files
https://api.gdc.cancer.gov/data/60cefd89-b428-46b7-b5b0-3b6e2743ab20
Note: BAI files are not available for sliced BAM files.
The mutation-based visualization features are derived from open-access MAF files produced by GDC variant-calling pipelines.
Data analysis and visualization features are only available for projects which maintain open-access MAF files. Programs such as TARGET maintain controlled-access MAF files only. As such, data analysis and visualization cannot be applied to TARGET projects.
The IMPACT is categorized by the Sequencing Ontology type of the variants that is also compatible to snpEff. The VEP IMPACT rating is a separate rating given for compatibility with other variant annotation tools (e.g. snpEff). Basically, each category is associated with a set of SO terms:
Details about predicted data in variations are available at ENSEMBL
Survival analysis in the GDC uses a Kaplan‑Meier estimator:
Please refer to the GDC Data Portal User's Guide Projects for additonal information
Yes. The GDC provides additional analysis endpoints to retrieve data sets associated with visualizations. Analysis endpoints include: survival, top_cases_counts_by_genes, top_mutated_genes_by_project, top_mutated_cases_by_gene, top_mutated_cases_by_ssm, and mutated_cases_count_by_project.
Please refer to the GDC API User's Guide Analysis Section for additional information.
Please refer to the GDC MAF File Specification to obtain detailed information on the format of GDC MAF files.
The “# Mutations” column in the Project or Exploration/Gene tab displays the number of distinct (unique) mutations within the affected cases and not necessarily the total number of all mutations within the project or query filter.
Within the GDC data analysis workflow, both public (somatic) MAFs and protected MAFs generated are from the same pipeline and link back to the same cases. For example, For the TCGA-GBM project, the somatic MAF has the following header:
# in TCGA.GBM.muse.7e85de23-3855-4279-a3ac-a81827e4ccb6.DR6.0.somatic.maf.gz
#version gdc-1.0.0
#filedate 20170307
#n.analyzed.samples 393
In general, n.analyzed.samples is used as a denominator to calculate mutation frequencies. If no variants for a case passed our filters, the case should still be counted; however, if the case was determined to have poor quality (such as for high contamination, duplicates etc.), it is not counted in the public MAF. In this particular project (TCGA-GBM), there were 396 cases with SNV data. Our analysis pipeline revealed that among them, a total of 5 GBM tumor aliquots had high contamination. Among these 5 patient, 2 had another good tumor aliquot, but 3 had only one aliquot. As the result, those 3 cases were removed from the public MAF.
The GDC is not normalizing frequency by gene length. This is currently under discussion. As such, these genes are appearing in the mutated genes table. Users can filter by the COSMIC Cancer Gene Census to display only genes for which mutations have been causally implicated in cancer.
The cases in the OncoGrid are filtered by consequence type. Only cases that have mutations that have consequence types of: {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained} are displayed in the OncoGrid.
To search for a mutation, you can utilize the Quick Search bar at the top right portion of the GDC Portal by entering in either a dbSNP reference cluster ID (rs#) or the coordinates of the chromosomal change. For example entering in 'rs121912651' or 'chr17:g.7674221G>A' will bring the user to the mutation entity page for that mutation.
There are less cases displayed with mutations in the 'Top Mutated Cancer Genes in Selected Projects' on the Project List Page because there is a filter on cases that have mutations on 1) Genes in the Cancer Gene Census and 2) Mutations with consequence types of {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained}.
Mutation frequency in the context of the OncoGrid represents total mutation occurrences in the gene (total count), while the # of Mutations listed on other portions of the GDC Portal represents the number of unique mutations on a gene or within a particular cohort.
Selecting the files of interest in the GDC Portal and adding them to the cart will give access to the "Sample Sheet" file. This file contains many fields like "File Name", "File ID" (UUID), and "Sample Type" for each file. The "Sample Type" field will denote whether the sample is from tumor or normal tissue and the other fields can be used to locate the appropriate files.
The GDC uses the eRA Commons login credentials to determine access to controlled data. The eRA Commons password does expire preventing users from logging in. The user will need to visit the eRA Commons login site. After logging in, the user will then be prompted to create a new password.
The availability of follow-up data is specific to the project and associated study.
For the Multiple Myeloma Research Foundation (MMRF) Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile (CoMMpass) study, longitudinal information was generated to track patients over the course of their disease. This data is available in the GDC by viewing the clinical follow-up data available for download on each case page in the GDC Data Portal or by querying the GDC API.
For TCGA, follow-up data is available for specific TCGA studies and made available for download in associated clinical supplement files (i.e. clinical XML, biotabs). Follow-up data can be different for the different TCGA studies.
Submitting treatment data is optional as not all projects are associated with treatment studies. For TCGA projects, for example, not all projects and cases have treatment data. For TCGA projects with treatment data, information is available in applicable clinical supplement files (i.e. clinical XML, biotabs). For other project associated with treatment studies in which the treatment data has been submitted to the GDC, treatment data is available for download in JSON and TSV format. These studies may also contain clinical supplement files.
The TCGA marker and other landmark cancer genomics papers, as well as associated supplemental files, are available on the GDC Publication Pages. The Publication Pages provide access to publication information and supplementary files.
The GDC harmonizes data across projects. This includes aligning the genomic data to a common reference genome (HG38) and generating higher level data using GDC bioinformatics pipelines. Other repositories may process the data differently.
For example, TCGA data in cBioPortal uses the original mutation data generated by the individual TCGA sequencing centers. The source of the data is the Broad Firehose (or the publication pages for data that matches a specific manuscript). These data are usually a combination of two mutation callers, but they differ by center (typically a variant caller like MuTect plus an indel caller), and sequencing centers have modified their mutation calling pipelines over time. TCGA data in the GDC is harmonized with the latest reference genome (GRCh38). Mutations are called using four variant callers: MuTect, VarScan2, MuSE, and Pindel.
TCGA “collection” represents the collection of the sample for TCGA, whereas “procurement” represents the removal of tissue from the patient.
The GDC supports the submission of clinical and biospecimen supplements. Supplemental files can be downloaded from the GDC by searching for the Data Type "Clinical Supplement" or "Biospecimen Supplement" from the facet search in the GDC Data Portal Repository. For TCGA data, the supplement data is provided in XML documents and tab delimited files (biotabs). These files, in varying degrees, provide information on marker status (e.g. EBV status), treatment regimen, slide magnification, histology distinctions, and staging questions.
Target capture kits are used to "target" specific regions of a given genome for the Whole Exome Sequencing (WXS) and Targeted Sequencing experimental strategies. Users should therefore take care when comparing data from different target capture kits for the WXS and Targeted Sequencing experimental strategies because of potential differences in genomic regions targeted, variant filtering, and subsequent variants recovered. Additionally, users should also consider that Whole Genome Sequencing (WGS) does not use target capture kits and may thus recover variants that are excluded in the WXS and Targeted Sequencing experimental strategies.
From the GDC Data Portal "Repository" page, the names of the target capture kits can be accessed by clicking the "Adding a File Filter" link and choosing "analysis.metadata.read_groups.target_capture_kit" from the menu. The GDC does not distribute target or bait files because they are intellectual property. Users should contact each individual project directly if detailed target capture region information is needed.
GENCODE gene sets are continuously updated to improve the coverage and accuracy. GENCODE 36, which was released in October of 2020, includes many updates to definitions of genes, transcripts, long non-coding RNAs, and other types of annotations. The previous version used by the GDC (GENCODE 22) was released in March 2015. Both versions were built on Ensembl genome assembly GRCh38.
The primary reasons for the fewer open-access mutations are from two strategies that improve quality: 1) TCGA is now using a 2-caller ensemble, instead of a single caller; 2) Removal of variants outside of the target capture region, instead of a combined “target capture + GAF exonic region”. Additionally, TCGA was the original project in which GDC open-access variants were produced and used variant rescue steps that only applied to TCGA. To keep the TCGA variant-calling pipeline consistent across projects, GDC is no longer rescuing MC3 and TCGA validation variants.
No. COSMIC genes present in the prior release were removed in DR32. For this reason, the GDC considers that the current release should be considered a higher-quality set of variants.
Although GENCODE v22 data cannot be browsed in the GDC Data Portal, it can still be downloaded using the GDC Data Transfer Tool or API. You will need to either have a previous manifest or use known UUIDs to download v22 files.
Whenever new parameters are introduced to a bioinformatics pipeline, such as a new gene model, there is a chance that the analysis could fail. A list of aliquots that do not appear in the v36 data currently can be found in the Data Release Notes
The GDC RNA-Seq workflow generates STAR counts in three different modes: unstranded, stranded_first, and stranded_second. The GDC then uses the unstranded counts as the major output for following-up FPKM and TPM normalizations to facilitate cross-project comparisons of different strandness.
If you are interested in using the stranded data in STAR Count gene expression output, you can make a guess by comparing N_ambiguous: if a stranded type has a much lower number of N_ambiguous compared to the other stranded type and the unstranded count, it is a good indicator of a stranded library being used. Please note that knowing a library is prepared by a stranded-enabled RNA-Seq kit does not necessarily guarantee the resulting library is stranded. In addition, data of different strandness can not be compared to each other directly.
The GDC does not perform batch effect corrections across samples for the following reasons:
As such, the GDC prefers that users to perform their own batch effect removal.
There is a variety of target capture kits used by different sequencing centers. Most of the whole exome capture kits share many common genomic regions, especially for cancer related genes; However, which exons are included is totally dependent on the vendor's library preparation kit. There are often more differences among capture regions from different Targeted-Sequencing/Panel data.
The name of the capture kit used is available from the GDC read group properties. However, the GDC does not distribute the BED files for the read groups associated with these capture kits because some of them are proprietary.
For the reference genome, the GDC has been using an augmented version of GRCh38.p2 (with additional decoy sequences and virus sequences) since inception. The GDC does not use alternative contigs, and only derives high-level data from the major chromosomes, so the same reference genome is used for both gene model GENCODE v22 (from Data Release 1 to 31) and GENCODE v36 (from Data Release 32). As future versions of the reference genome are released, e.g., GRCh39, the GDC will evaluate the benefits of updating data to utilize the new version. By updating the reference genome, the GDC would expect to re-process all data sets. For information on the reference genome used by the GDC, please refer to the GDC Reference Files.
For workflow updates, the GDC prefers to keep the workflow stable, and will not update unless there are necessary updates such as updates of the reference genome or gene model, or major algorithm updates in the tools that could result significant changes in the generated data. When workflow updates are actually needed, the GDC categorizes them as either major updates or minor updates depending on whether the update significantly affects the output data. The GDC will re-process all existing data sets in major workflow updates, and such examples include transitioning the RNA-Seq genomic BAM alignment workflow into a new version that generates three BAMs and STAR counts; and updating the MAF workflow to add additional functions to the MAF files. Minor updates mostly happen to resolve bugs, security issues, and/or compatibility issues. For example, the GDC DNA-Seq alignment workflow has been updated several times to address quality issues from various submitted data; however, because the main alignment algorithm remains almost the same, the GDC does not need to re-process all the data sets for these minor updates.
HTSeq had been the default RNA-Seq expression quantification tool since the first GDC data release. The GDC later updated the RNA-Seq alignment and quantification workflow to include STAR Count, which generates stranded counts by default in addition to the existing unstranded counts. During Data Release 32 for gene model updates, the GDC had 1) augmented the existing STAR Count output to include FPKM and FPKM-UQ normalizations; 2) reprocessed all the TCGA data using the latest RNA-Seq workflow with STAR Count. Because both tools use very similar counting strategies, and STAR Count has the advantages in both running time and the additional stranded counts, the GDC removed HTSeq workflow in Data Release 32.
Any germline SNP calls are not available for exploration in the GDC Data Portal. Instead, alignments for germline data are available under controlled access. Users with appropriate access may use the alignments to generate germline variants.
Some somatic variants callers, such as MuTect2, also output somatic calls with some level of germline possibilities, such as those labelled as "germline_risk". Please note that these calls are, by no means, germline variants. They are somatic calls with boundary probability of germline risks.
The SomaticSniper whole exome variant caller was one of the first generation somatic mutation callers developed by the scientific community. It works the best with blood cancer that has high level of tumor-in-normal contaminations, but is often overly permissive for solid tumors. Since our first data release in 2016, the GDC has gradually adopted newer tools or new tool versions, and has transited the focus of somatic variant calling from any single caller to multi-caller ensemble.
After comparing ensemble calls with and without SomaticSniper and also receiving feedback from the authors of SomaticSniper, the GDC decided to remove this tool from our production in Data Release 35. The GDC still maintains other four whole exome variant callers, including MuSE, MuTect2, Pindel, and VarScan2.
Generally any WGS data should have associated structural variant files (BEDPE) except in the cases in which either there are no tumor/normal matches or when variant calling has not been implemented yet.
Filtering in Cohort Builder results in a set of cases. Cases typically have available data for multiple Experimental Strategies (e.g., cases with RNA-Seq data can also have DNA-Seq data). As such, when navigating to Repository additional data besides RNA-Seq data is displayed. Repository filters are file based. When filtering Repository by Experimental Strategy (e.g., RNA-Seq) only files associated with the Experimental Strategy are displayed.
To view the top frequently mutated genes for a cohort, first build a cohort using the Cohort Builder and then select the Mutation Frequency tool in the Analysis Center
Filtering in Cohort Builder results in a set of cases. A case can have multiple diagnosis and tissues or organ of origins may be related to a secondary diagnosis.