Main Content

GDC FAQs

FILTERS
  • What is the NCI Genomic Data Commons (GDC)?

    The NCI Genomic Data Commons (GDC) is the next generation repository and cancer knowledge base supporting the import and standardization of genomic and clinical data from cancer research programs (e.g. TCGATARGETCGCI), the harmonization of sequence data to the genome / transcriptome, and the application of state-of-the art methods for derived data (e.g. mutation calls, structural variants, etc.).

    The NCI part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services (DHHS) established the GDC to provide the cancer research community with a data service supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit About the GDC for additional information.

  • What are the goals of the GDC?

    The primary goal of the GDC is to provide the cancer research community with a unified repository and cancer knowledge base supporting cancer genomic studies. The cancer knowledge base enables the identification of low-frequency cancer drivers, assists in defining genomic determinants of response to therapy, and informs the composition of clinical trial cohorts sharing targeted genetic lesions. Working towards this goal, the GDC provides resources supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit the GDC Overview for additional information.

  • How can I collaborate with the GDC?

    The GDC welcomes collaborations with organizations conducting research in or providing informatics supporting cancer genomics. Organizations interested in collaborating with the GDC should contact GDC Support.

  • Are there restrictions on the use of GDC data in publications?

    No. All GDC data can be used in publications or presentations. For additional questions about the use of GDC data, or to explore opportunities for collaboration, please contact GDC Support.

  • How do I cite the NCI GDC?

    Please credit the NCI Genomic Data Commons (GDC) in your manuscript by citing the following paper about the GDC in your manuscript:

    Heath, A.P., Ferretti, V., Agrawal, S. et al. The NCI Genomic Data Commons. Nat Genet 53, 257-262 (2021). https://doi.org/10.1038/s41588-021-00791-5 

    When citing individual projects, please refer to the attribution policies of the project when available.

  • What reference genome is the GDC harmonized against?

    The GDC is harmonized against GRCh38. Please see GDC Data Harmonization for additional information on the GDC pipelines for re-aligning genomic data.

  • How does the GDC generate high level data?

    The GDC generates high level data for germline and somatic genotyping, RNA-Seq quantification and structural analysis, SNP Array Genotyping and CNV Calls, and variant annotations. Please visit GDC Data Harmonization for additional information on the GDC high level data generation pipelines.

  • How do I obtain an account to log in to the GDC?

    Generally, browsing indexed GDC metadata (such as information about the cases and files contained in the GDC Data Portal) does not require a login.

    eRA Commons authentication and dbGaP authorization are required before accessing controlled data, which generally includes individually identifiable information such as low level genomic sequencing data and germline variants.

    Controlled-access data users log in to the GDC using their eRA Commons accounts. The GDC then verifies that the user has authorization in dbGaP to access specific controlled datasets.

    See Obtaining Access to GDC Data and Resources for more information on data download, and Obtaining Access to Submit Data for information on data submission.

  • Where do I go to report an issue or submit an inquiry about the GDC?

    The GDC provides helpdesk support for data submission and other issues. For information on the GDC helpdesk, please visit GDC Support.

  • When is GDC maintenance performed?

    The GDC maintenance window is semi-monthly occurring Saturday, from 8:00 am to 4:00 pm CST / 9:00 am to 5:00 pm EST.

  • How is validation performed on genomic data (BAM files) submitted to the GDC?

    The GDC validates genomic data (BAM files) using FASTQC and Picard. For additional details, please visit: GDC Data Harmonization.

  • Why does the GDC have data releases and how often do they happen?

    Recurrent data releases allow the GDC to version the data and allow users to reference the GDC version number in publications. GDC currently generates releases as needed, with a release every 2-3 months as a goal.

  • I only see patients with ages of 90 years or less in the GDC. Why is this?

    HIPAA guidelines require that patients with ages greater than 89 years be aggregated into a single age category. This is to limit the ability to positively identify these individuals. In practice this will impact the values reported in several fields. We have chosen to accurately display the age at diagnosis, but fields that give dates or time periods after this benchmark may be compressed. This may include such fields as "Days to last follow up", "Days to last known disease status", "Days to recurrence", "Days to death", and "Year of death". Other fields found in clinical supplement files may also be impacted. In the Data Portal you will only see ages of less than or equal to 90. Individuals over 89 will all appear as 90 years old.

  • What web browsers are supported by the GDC?

    The following web browsers are supported for use with the GDC Data Portal, Submission Portal, Website, and Documentation site.

    • Most recent supported stable version of Microsoft Edge
    • Most recent stable version of Google Chrome
    • Most recent stable version of Mozilla Firefox
  • Why does the GDC use unstranded TPM when the library is stranded?

    The GDC RNA-Seq workflow generates STAR counts in three different modes: unstranded, stranded_first, and stranded_second. The GDC then uses the unstranded counts as the major output for following-up FPKM and TPM normalizations to facilitate cross-project comparisons of different strandness.

  • How can I know if the RNA-Seq data is stranded or unstranded?

    If you are interested in using the stranded data in STAR Count gene expression output, you can make a guess by comparing N_ambiguous: if a stranded type has a much lower number of N_ambiguous compared to the other stranded type and the unstranded count, it is a good indicator of a stranded library being used. Please note that knowing a library is prepared by a stranded-enabled RNA-Seq kit does not necessarily guarantee the resulting library is stranded. In addition, data of different strandness can not be compared to each other directly.