About the Data

Cancer is fundamentally a disease of the genome, caused by changes in the DNA, RNA, and proteins of a cell that push cell growth into overdrive. Identifying the genomic alterations that arise in cancer can help researchers decode how cancer develops and improve upon the diagnosis and treatment of cancers based on their distinct molecular abnormalities.

Data made available through the GDC is for research purposes only. The GDC provides researchers with access to standardized clinical, proteomic, epigenomics, and genomic data from cancer studies to enable exploratory analysis that cannot be considered definitive for outcomes.

The GDC assists researchers in exploratory analysis by identifying changes in cancer cells that may play an important role in cancer development.Through the GDC knowledge base, researchers can leverage data maintained in the GDC to assist in identifying both high- and low-frequency cancer drivers such as:

  • Mutations - The GDC provides access to DNA sequence data and generates associated Variant Calling Format (VCF) and Mutation Annotation Format (MAF) files that identify somatic mutations such as point mutationsmissense mutationsnonsense mutations, and insertions and deletions (indels) of nucleotides in the DNA.
  • Copy Number Variants - The GDC provides access to Copy Number Variation (CNV) data to identify amplified and attentuated gene expression due to chromosomal duplications, loss, insertions and deletions.
  • Expression Quantification - The GDC provides access to mRNA and miRNA sequence data and quantifies gene and miRNA expression using standardized software pipelines; expression values are provided in simple tab-separated value format.
  • Post-transcriptional Modifications - The GDC provides access to mRNA sequence data to assist in identifying post-transcriptional splice modifications that are manifested as splice junction and isoform variants.
  • Structural Variants - The GDC provides access to Structural Variant data to identify genomic rearrangement events, such as fusions, duplications, truncations, large deletions, and others.
  • DNA Methylation - The GDC provides access to DNA CpG Methylation data to identify epigenomic modifications on the DNA.
  • Protein Expression - The GDC provides access to Protein Expression data to identify changes in protein expression and/or post-translational modifications.

Data and metadata is submitted to the GDC in standard data types and file formats through the GDC Data Submission Pipeline. Molecular data stored in the GDC are harmonized against a common reference genome.

The GDC provides access to high quality data sets from NCI-supported programs and recommends guidelines for organizations providing datasets from non-NCI supported programs. High quality datasets are achieved by:

Maintaining High-Quality Tissue Samples

The GDC obtains datasets from NCI programs that ensure high quality by accepting only tissues that have extensive information on their sources and have undergone stringent material processing quality control. The GDC obtains datasets from other NCI-supported programs that adhere to high quality standards for handling biospecimens and associated analytes and recommends that non-NCI supported programs submitting data to GDC follow Biospecimen Best Practices.  Organizations wishing to submit data to the GDC should visit the NCI Biorepositories and Biospecimen Research Branch (BBRB) site for recommended tissue collection strategies for generating high quality datasets.

The GDC maintains standardized biospecimen information on the tissues and samples used in each study. Information on the tissue sample, portion, analyte, and/or aliquot is extracted from submitted data, maintained in the GDC data model, and made accessible via the GDC Data Portal.

Implementing Data Validation Procedures

Data validation is performed on both data imported into the GDC from existing NCI programs and data submitted to the GDC through the GDC Data Submission Pipeline. Submitted data is not distributed by the GDC unless it passes GDC validation.

Validation of Data Imported from Existing NCI Programs

Validation of data imported into the GDC from existing NCI programs includes:

  • Verification of MD5 Checksum - Comparison of the provided MD5 Checksum with the MD5 Checksum of the file downloaded to the GDC
  • Validation of Data References - Cross-check of existing Universally Unique Identifiers (UUIDs) or barcodes with the primary source

Validation of Data Submitted into GDC

For data submitted to GDC through the GDC Data Submission Pipeline, the following data validations checks are performed:

  • Verification of MD5 Checksum - Comparison of the provided MD5 Checksum with the MD5 Checksum of the file submitted to the GDC.
  • Data Format Validation - Validation of biospecimen, clinical, and molecular data against the standard GDC Data Types and File Formats.
  • Validation of Data References - Cross check of existing Universally Unique Identififers (UUIDs) or barcodes across Biospecimen Data, Clinical Data and Molecular Data.
  • Molecular Data Validation - Validation of molecular data using GDC Data Harmonization software and algorithms.

Ensuring Reliable Harmonized and Derived Data Production

The GDC uses submitted genomic sequence data to create derived data products such as somatic DNA mutations, tumor gene expression, and per-gene copy number variation. Bioinformatics pipelines described in GDC Data Harmonization are developed with ongoing input from recognized experts in the cancer genomics community. Pipelines are implemented using techniques to make them reproducible, interoperable across multiple platforms, and shareable with any interested member of the community. The GDC welcomes all input and suggestions on its bioinformatics pipelines, and is committed to continuously updating pipelines as the field advances, retiring old tools and incorporating new ones as the state of the art changes.