What is the NCI Genomic Data Commons (GDC)?

The NCI Genomic Data Commons (GDC) is the next generation cancer knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs (e.g. TCGATARGETCGCI), the harmonization of sequence data to the genome / transcriptome, and the application of state-of-the art methods for derived data (e.g. mutation calls, structural variants, etc.).

The NCI part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services (DHHS) established the GDC to provide the cancer research community with a data service supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit About the GDC for additional information.

What are the goals of the GDC?

The primary goal of the GDC is to provide the cancer research community with a unified data repository supporting cancer genomic studies. The unified repository will provide a cancer knowledge network that enables the identification of low-frequency cancer drivers, assists in defining genomic determinants of response to therapy, and informs the composition of clinical trial cohorts sharing targeted genetic lesions. Working towards this goal, the GDC provides resources supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit the GDC Overview for additional information.

How can I collaborate with the GDC?

The GDC welcomes collaborations with organizations conducting research in or providing informatics supporting cancer genomics. Organizations interested in collaborating with the GDC should contact GDC Support.

Are there restrictions on the use of GDC data in publications?

No. All GDC data can be used in publications or presentations. For additional questions about the use of GDC data, or to explore opportunities for collaboration, please contact GDC Support.

How do I cite the NCI GDC?

Please credit the NCI Genomic Data Commons (GDC) in your manuscript by citing the following paper about the GDC in your manuscript:

Grossman, Robert L., Heath, Allison P., Ferretti, Vincent, Varmus, Harold E., Lowy, Douglas R., Kibbe, Warren A., Staudt, Louis M. (2016) Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine375:12, 1109-1112

When citing individual projects, please refer to the attribution policies of the project when available.

How do I query and download data from the GDC?

The GDC provides several resources for querying and downloading data from the GDC including the GDC Data Portal for querying and downloading GDC data files, the GDC Data Transfer Tool for downloading large volumes of files, and the GDC Application Programming Interface (API) for performing programmatic queries and downloads.

How do I submit data into the GDC?

Information on the data submission processes and tools are available on the GDC Data Submission Processes and Tools page. Detailed instructions for submitting data into the GDC are provided in the GDC Data Submission Portal User's Guide.  Per GDC Policy, organizations interested in submitting data into the GDC must first apply for data submitter access through the NIH database of Genotypes and Phenotypes (dbGaP).

What data types and file formats does the GDC support?

Please visit the GDC Data Types and File Formats for a list of the standard data types supported by the GDC.

What reference genome is the GDC harmonized against?

The GDC is harmonized against GRCh38. Please see GDC Data Harmonization for additional information on the GDC pipelines for re-aligning genomic data.

How does the GDC generate high level data?

The GDC generates high level data for germline and somatic genotyping, RNA-Seq quantification and structural analysis, SNP Array Genotyping and CNV Calls, and variant annotations. Please visit GDC Data Harmonization for additional information on the GDC high level data generation pipelines.

How does the GDC maintain secure access to controlled access data?

GDC data is stored in a secure FISMA-compliant facility. Access to controlled data requires authorization via dbGaP. See Data Access Processes and Tools for more information.

How do I obtain an account to log in to the GDC?

Generally, browsing indexed GDC metadata (such as information about the cases and files contained in the GDC Data Portal) does not require a login.

eRA Commons authentication and dbGaP authorization are required before accessing controlled data, which generally includes individually identifiable information such as low level genomic sequencing data and germline variants.

Controlled-access data users log in to the GDC using their eRA Commons accounts. The GDC then verifies that the user has authorization in dbGaP to access specific controlled datasets.

See Obtaining Access to GDC Data and Resources for more information on data download, and Obtaining Access to Submit Data for information on data submission.

Where do I go to report an issue or submit an inquiry about the GDC?

The GDC provides helpdesk support for data submission and other issues. For information on the GDC helpdesk, please visit GDC Support.

How do I create an advanced search query?

Users can perform advanced SQL-like queries using the GDC Data Portal Search interface. Instructions on using the GDC Data Portal Search interface are available in the GDC Data Portal User's Guide.

What are the system requirements for using the GDC Data Transfer Tool?

System requirements for using the GDC Data Transfer Tool are available on the GDC Data Transfer Tool page.  Additional details are in the GDC Data Transfer Tool User's Guide.

How do I register my project with the GDC?

Once the project has been registered through dbGaP please contact the GDC Helpdesk for assistance with setting up a new project. 

Where do I go to find code examples for using the GDC API?

The GDC provides code examples in the GDC Application Programming Interface (API) User's Guide

When is GDC maintenance performed?

The GDC maintenance window is semi-monthly occurring Saturday, from 8:00 am to 4:00 pm CST / 9:00 am to 5:00 pm EST.

What is the recommended tool and protocol for transferring large volumes of data to or from the GDC?

The GDC Data Transfer Tool is recommended for transferring large datasets to or from GDC. For additional details, please visit the GDC Data Transfer Tool User’s Guide.

When using the GDC Data Transfer Tool, is it possible to set a bandwidth limit?

The GDC Data Transfer Tool does not offer a setting to limit the bandwidth it uses.

Does the GDC Data Transfer Tool use random or sequential read/write? Does the choice of protocol make a difference?

The GDC Data Transfer Tool uses sequential read/write for each file segment that is being transferred. By default, the tool executes multipart transfers, which results in multiple parallel, sequential read or write operations. To turn off multipart transfers, users can set the number of processes to 1.

How long do GDC authentication tokens remain valid?

GDC authentication tokens remain valid for 30 days.

What steps must be taken in dbGaP before data can be submitted to the GDC?

The study and Subject IDs must be registered in dbGaP. For additional details, please visit: Obtaining Access to Submit Data.

How is validation performed on genomic data (BAM files) submitted to the GDC?

The GDC validates genomic data (BAM files) using FASTQC and Picard. For additional details, please visit: GDC Data Harmonization.

What is the process for uploading, submitting, and releasing data in the GDC?

Uploaded and validated data is put in a workspace until the user formally submits the data to the GDC. This allows users to interact with the data before submitting. Once the data is submitted, the GDC will process applicable datasets (e.g. harmonize molecular data and generate high level data). After processing has been completed, the data is made publicly available according to GDC Data Sharing Policies. The data becomes accessible through GDC tools (GDC Data Portal, GDC APIs) on open or controlled access basis according to the dbGaP authorization policies associated with the data set. For additional information, please visit: GDC Data Submission Processes and Tools.

Why does the GDC have data releases and how often do they happen?

Recurrent data releases allow the GDC to version the data and allow users to reference the GDC version number in publications. GDC currently generates releases as needed, with a monthly release as a goal.

Where can I find more information about the GDC data model?

The GDC employs a hierarchical data model which requires metadata and files to be attached only at particular nodes or points in the hierarchy.  If you have questions, please review the GDC Data Model or contact GDC Support.

How many bytes are there in a megabyte or gigabyte?

There has been long standing debate about prefixes for multiples of bytes. We have chosen to utilize the standard supported by the International System of Units (SI) where 1 gigabyte (GB) = 109 bytes or 1 megabyte (MB) = 106 bytes. This convention is also supported by the IEEE, EU, NIST, and the International System of Quantities. Where appropriate, we utilize the IEEE 1541 recommendations for binary representation where 10243 bytes = 1 gibibyte (GiB) or 10242 bytes = 1 mebibyte (MiB).

Why are some harmonized data files missing?

The GDC processes data through several harmonization pipelines. If the process of harmonization reveals issues in the underlying data or if an error occurred during harmonization, the harmonized data files (e.g. BAMs or VCFs) will not appear in GDC data access tools.

What is the difference between TCGA data available in the GDC Data Portal and TCGA data available in the GDC Legacy Archive?

Data in the GDC Data Portal has been harmonized using GDC Bioinformatics Pipelines whereas data in the GDC Legacy Archive is an unmodified copy of data that was previously stored in CGHub and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC).

Certain previously available data types and formats are not currently supported by the GDC Data Portal and are only distributed via the GDC Legacy Archive.

In the Most Frequent Mutations table for the VEP impact score, which algorithm in the VEP is the GDC using to determine “H" or “M”?

The IMPACT is categorized by the Sequencing Ontology type of the variants that is also compatible to snpEff. The VEP IMPACT rating is a separate rating given for compatibility with other variant annotation tools (e.g. snpEff). Basically, each category is associated with a set of SO terms:

  • HIGH: The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay: transcript_ablation, splice_acceptor_variant, splice_donor_variant, stop_gained, frameshift_variant, stop_lost, start_lost, transcript_amplification
  • MODERATE: A non-disruptive variant that might change protein effectiveness: inframe_insertion, inframe_deletion, missense_variant, protein_altering_variant, regulatory_region_ablation
  • LOW: Assumed to be mostly harmless or unlikely to change protein behavior: splice_region_variant, incomplete_terminal_codon_variant, stop_retained_variant, synonymous_variant
  • MODIFIER: Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact: coding_sequence_variant, mature_miRNA_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, non_coding_transcript_exon_variant, intron_variant, NMD_transcript_variant, non_coding_transcript_variant, upstream_gene_variant, downstream_gene_variant, TFBS_ablation, TFBS_amplification, TF_binding_site_variant, regulatory_region_amplification, feature_elongation, regulatory_region_variant, feature_truncation, intergenic_variant

Details about predicted data in variations are available at ENSEMBL

Where can I download TCGA clinical data in Biotab format?

TCGA has distributed clinical data in two formats: BCR XML and Biotab. The GDC Data Portal distributes TCGA Clinical Data in BCR XML only. Biotab data is available in the GDC Legacy Archive.

How is survival analysis calculated?

Survival analysis in the GDC uses a Kaplan‑Meier estimator:

Survival Analysis Formula

  • S(t ) is the estimated survival probability for any particular one of the t time periods
  • n is the number of subjects at risk at the beginning of time period t
  • d is the number of subjects who die during time period t

Please refer to the GDC Data Portal User's Guide Projects for additonal information

Where can I download TCGA protein array (RPPA) data?

TCGA RPPA data has not been harmonized by the GDC. It is available in the GDC Legacy Archive.

Can I use the GDC Application Programming Interface (API) to retrieve data sets associated with visualizations?

Yes. The GDC provides additional analysis endpoints to retrieve data sets associated with visualizations. Analysis endpoints include: survival, top_cases_counts_by_genes, top_mutated_genes_by_project, top_mutated_cases_by_gene, top_mutated_cases_by_ssm, and mutated_cases_count_by_project.

Please refer to the GDC API User's Guide Analysis Section for additional information.

Where can I download TCGA DNA methylation data?

DNA methylation data collected by the TCGA has not been harmonized by the GDC. It is available in the GDC Legacy Archive.

Where can I find information on the format of GDC MAF Files?

Please refer to the GDC MAF File Specification to obtain detailed information on the format of GDC MAF files.

How can I download TCGA archive files?

Archive files that were previously available from the TCGA Data Portal (TCGA DCC) can be downloaded from the GDC API using the Archive UUID that can be obtained from the GDC Legacy Archive. To do this:

  1. Find a file associated with the archive in the GDC Legacy Archive
  2. On the file detail page, click the link found in the Archive property
  3. On the page that shows a list of files associated with the archive, locate the archive UUID at the top of the page.
  4. Use the UUID to download the archive from the GDC API's data enpoint: , where UUID is the archive UUID.

On the GDC Project summary page or Exploration/Gene tab, why are the # of Mutations sometimes less than the # Affected Cases?

The “# Mutations” column in the Project or Exploration/Gene tab displays the number of distinct (unique) mutations within the affected cases and not necessarily the total number of all mutations within the project or query filter.

Where can I find the target and bait/probe files (BED files) that describe the capture kit used in an exome sequencing experiment?

Capture kit information is provided by the GDC API at the read group level, where available. In some cases, additional information may be available in SRA XML files.

The relevant read_group properties returned by the GDC API are:

  1. target_capture_kit_name
  2. target_capture_kit_catalog_number
  3. target_capture_kit_vendor
  4. target_capture_kit_target_region

The target_capture_kit_target_region field provides a URL for the capture kit target file, distributed by the kit manufacturer or by the research program. Bait/probe files can sometimes be found at the same URL; or a URL to the bait/probe file may be available in the SRA XML file.

Note: Some BAM files include information from multiple read groups, and sometimes read groups produced with different capture kits are included in the same BAM file. Tools are available for splitting BAM files into read groups, e.g. bamutil.

Note: Target and bait/probe files may use an older reference genome, so liftover may be required for certain applications.

Why are the number of analyzed cases in the MAF header not equal to the number of cases displayed in the GDC Data Portal?

Within the GDC data analysis workflow, both public (somatic) MAFs and protected MAFs generated are from the same pipeline and link back to the same cases. For example, For the TCGA-GBM project, the somatic MAF has the following header:

# in TCGA.GBM.muse.7e85de23-3855-4279-a3ac-a81827e4ccb6.DR6.0.somatic.maf.gz
#version gdc-1.0.0
#filedate 20170307
#n.analyzed.samples 393

In general, n.analyzed.samples is used as a denominator to calculate mutation frequencies. If a case does not have variants, it should be counted; however, if a case is in the blacklist (such as for high contamination, duplicates etc.), it is not be counted in the public MAF. In this particular project (TCGA-GBM), there were 396 cases with SNV data. Our analysis pipeline revealed that among them, a total of 5 GBM tumor aliquots had high contamination. Among these 5 patient, 2 had another good tumor aliquot, but 3 had only one aliquot. As the result, those 3 cases were removed from the public MAF.

What are the best practices for downloading molecular data files from the GDC?

The GDC Data Transfer Tool is the preferred method for downloading data files from the GDC. For multiple file downloads you may create a manifest within the GDC Data Portal on the shopping cart page, which you will then provide to the tool. Alternatively, for single file downloads you may supply individual file UUIDs.

You may also download files directly from the Data Portal and Legacy Archive. In this case we recommend adding the desired files to your shopping cart and then downloading from the cart page. It is not recommended to select multiple files for simultaneous download from the Files page. Web Browsers have hardwired limitations on the number of simultaneous downloads that are allowed. Exceeding this threshold may lead to reduced performance within the Data Portal or Legacy Archive.

Utilizing the API is also an acceptable method for downloading multiple files.

Are unmapped reads available in the GDC Data Portal?

Harmonized BAM files from RNA-seq and DNA-seq experiments will contain both mapped and unmapped reads, if available. Unmapped reads are not distributed separately.

Why does the GDC display common genes such as TTN that are associated with every cancer in the most frequently mutated genes table?

The GDC is not normalizing frequency by gene length. This is currently under discussion. As such, these genes are appearing in the mutated genes table. Users can filter by the COSMIC Cancer Gene Census to display only genes for which mutations have been causally implicated in cancer.

When logging in using Internet Explorer as a browser, a pop-up window appears asking for credentials. However, once I enter my credentials, the pop-up window goes blank and stays open. Am I logged in?

The issue is caused by Internet Explorer running in "Compatibility View" for NIH users. To avoid this issue: 1) Go to Internet Explorer Tools menu, 2) Select 'Compatibility View settings', 3) Uncheck 'Display intranet sites in Compatibility View', 4) Select Close, 4) Refresh and log in again. The window should now close automatically.

Where can I find TCGA RNASeqV2 data?

RNASeqV2 data is available in the GDC Legacy Archive. To find these files, add a "Tags" custom File filter, and set its value to v2.

In the OncoGrid, why are there less cases than there are cases listed as having mutations?

The cases in the OncoGrid are filtered by consequence type. Only cases that have mutations that have consequence types of: {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained} are displayed in the OncoGrid.

I only see patients with ages of 90 years or less in the GDC. Why is this?

HIPAA guidelines require that patients with ages greater than 89 years be aggregated into a single age category. This is to limit the ability to positively identify these individuals. In practice this will impact the values reported in several fields. We have chosen to accurately display the age at diagnosis, but fields that give dates or time periods after this benchmark may be compressed. This may include such fields as "Days to last follow up", "Days to last known disease status", "Days to recurrence", "Days to death", and "Year of death". Other fields found in clinical supplement files may also be impacted.

In the Data Portal you will only see ages of less than or equal to 90. Individuals over 89 will all appear as 90 years old.

How can I access GDC sequencing data in FASTQ format?

Raw sequencing files submitted to the GDC are processed using GDC Genomic Data Alignment pipelines. The processed data are made available in the GDC Data Portal as BAM files containing aligned reads and unmapped reads (if available). No reads are hard-clipped, but reads that were flagged as "failed" during an Illumina sequencing run are discarded.

Third-party tools such as biobambam2 or Samtools fastq can convert these files to FASTQ sequencing data. Note that DNA-Seq quality scores are modified during the score recalibration co-cleaning step, so third-party tool parameters must be set to retrieve the original scores (biobambam2: tryoq=1; samtools fastq: -O). Because GDC harmonized BAM files may contain multiple read groups, the conversion parameter should be set to retain read group IDs in the generated FASTQ files (biobambam2: outputperreadgroup=1; samtools: samtools split).

Certain RNA-Seq FASTQ files are available in the GDC Legacy Archive as compressed TAR or TAR.GZ archives.

How do I search for a particular mutation?

To search for a mutation, you can utilize the Quick Search bar at the top right portion of the GDC Portal by entering in either a dbSNP reference cluster ID (rs#) or the coordinates of the chromosomal change. For example entering in 'rs121912651' or 'chr17:g.7674221G>A' will bring the user to the mutation entity page for that mutation.

What web browsers are recommended by the GDC?

The following web browsers are recommended for use with the GDC Data Portal, Submission Portal, Legacy Archive, Website, and Documentation site.

* Most recent supported stable version of Microsoft Internet Explorer on Microsoft Windows
* Most recent stable version of Google Chrome
* Most recent stable version of Mozilla Firefox

Where can I find hg18 and hg19 GAF files for legacy TCGA data?

These files are available on the GDC Reference Files page.

Why are there less cases in the 'Top Mutated Cancer Genes in Selected Projects' bar graph on the Project List Page, than there are affected cases listed on each project page?

There are less cases displayed with mutations in the 'Top Mutated Cancer Genes in Selected Projects' on the Project List Page because there is a filter on cases that have mutations on 1) Genes in the Cancer Gene Census and 2) Mutations with consequence types of {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained}.

How do I obtain access to a specific controlled dataset?

The GDC provides access to both open and controlled datasets. To access controlled datasets, users must obtain appropriate authorization through dbGaP. See Obtaining Access to Controlled Data for instructions on applying for access through dbGaP.

How do I search for files by CGHub analysis ID?

The CGHub analysis ID is stored in the submitter id field. Users can search by analysis ID using the "Search for File ID" search box in the Files tab on the left-hand side of the GDC Legacy Archive.

Why is the mutation frequency value higher (more mutations) in the OncoGrid than the # of Mutations listed on other pages for the same gene?

Mutation frequency in the context of the OncoGrid represents total mutation occurrences in the gene (total count), while the # of Mutations listed on other portions of the GDC Portal represents the number of unique mutations on a gene or within a particular cohort.

How do I avoid timeouts and transfer interruptions when downloading large datasets from the GDC Data Portal?

The GDC Data Portal is a web-based application that is limited by browser and network constraints. If a system timeout occurs when downloading files, please use the GDC Data Transfer Tool or contact the GDC Help Desk.

Why might variants found in TCGA-generated MAFs be missing from the GDC open access MAF files?

Some of the reasons particular mutations may have been removed include updates to third party databases, more conservative germline-masking rules by the GDC, and different mutation calling pipelines and versions. Despite these differences, the GDC recaptures over 97% of TCGA-validated variants in the controlled-access MAF files. The GDC suggests using controlled-access MAF files if important variants cannot be found in somatic MAF files.

Why do the metadata files I am trying to submit fail to validate?

The GDC Data Submission Portal checks XML, JSON, and TSV metadata files for validity at the time they are submitted. If your files fail to validate, please check the error report and review the GDC Data Dictionary for troubleshooting these errors. Additional information on supported files and formats can be found on the GDC Data Model and File Formats pages, and in the GDC Data Submission Portal User's Guide.

What is the difference between files that have the same filename?

The file detail page and the metadata files accessible from that page (if available) can be used to determine the difference between files that share the same filename. For example, the files may be associated with different aliquots, different patients, or (for files in the GDC Legacy Archive) different archives. For some projects, the GDC Legacy Archive may contain multiple BAM files from the same aliquot, with the later version containing additional read groups as described in the associated Run Metadata file.

Will the "cghub.key" authentication token I used for CGHub work for the GDC?

CGHub tokens cannot be used in the GDC. To perform functions for which authorization is required, users must authenticate using a token generated by the GDC Data Portal or the GDC Data Submission Portal. Users can obtain authentication tokens only after receiving appropriate authorization via dbGaP. The authorization credential is not required to query the metadata; it is only required for downloading protected data.

Note: Unlike CGHub, the GDC does not require a URL to the public key for open access data.

How can I download BAM index files (BAI files) using the API?

BAI files are included with the download when using the GDC Data Transfer Tool to download BAM files.

When using the API to download BAM files, BAI files will only be included if the related_files=true parameter is specified together with the BAM UUID, for example:

Alternatively, users can determine the BAI file UUID from the API files endpoint by supplying the BAM UUID. The BAI file UUID can then be used to download the BAI file from the data endpoint.

Note: BAI files are not available for sliced BAM files.

Does the GDC have a command line query tool analogous to CGHub’s cgquery?

The GDC does not have a standalone command line query tool like cgquery. The GDC supports search via the GDC REST API and the GDC Data Portal. Users can call the GDC API directly through a browser or though other third-party tools such as curl, wget, HTTPie , Postman and DHC REST Client, or they can develop custom GUIs or command line tools for communicating with the GDC API. In fact, the entire GDC Data Portal user interface runs on the underlying GDC API which uses standard HTTP methods like GET, PUT, POST and DELETE and uses JSON as its communication format.

Once you have queried data that you are interested in downloading, you may use the GDC UUIDs or generate a manifest and use the GDC API’s download endpoint or the GDC Data Transfer Tool to download the data. To learn more about the GDC API, please see The GDC Application Programming Interface (API): An Overview.

From what data are mutation-based visualization features derived?

The mutation-based visualization features are derived from open-access MAF files produced by GDC variant-calling pipelines.

What is the difference between CGHub and the GDC?

CGHub has served as the repository and distribution platform for molecular-level genomic data from The Cancer Genome Atlas (TCGA) and related projects.

The Genomic Data Commons (GDC) is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. The GDC harmonizes data across projects using a common set of bioinformatics pipelines, so that the data can be directly compared. The GDC takes over the responsibility for storing and distributing data that has previously been stored in CGHub. The GDC will distribute this data in conjunction with clinical and biospecimen metadata previously available in the TCGA Data Portal.

Why are there some projects without data analysis and visualization features?

Data analysis and visualization features are only available for projects which maintain open-access MAF files. Programs such as TARGET maintain controlled-access MAF files only. As such, data analysis and visualization cannot be applied to TARGET projects.