Main Content

GDC Data Quality

The GDC provides access to high quality data sets from NCI-supported programs and recommends use of biospecimen best practices for organizations generating datasets from non-NCI supported programs. High quality datasets are achieved through GDC processes and tools for submitted data validation and data harmonization, which includes aligning data to a common reference genome and generating higher level data. 

High-Quality Tissue Samples

The GDC obtains datasets from NCI programs that ensure high quality by accepting only tissues that have extensive information on their sources and have undergone stringent material processing quality control. The GDC obtains datasets from other NCI-supported programs that adhere to high quality standards for handling biospecimens and associated analytes and recommends that non-NCI supported programs submitting data to GDC follow Biospecimen Best Practices. Organizations wishing to submit data to the GDC should visit the NCI Biorepositories and Biospecimen Research Branch (BBRB) site for recommended tissue collection strategies for generating high quality datasets. 

The GDC maintains standardized biospecimen information on the tissues and samples used in each study. Information on the tissue sample, portion, analyte, and/or aliquot is extracted from submitted data, maintained in the GDC data model, and made accessible via the GDC Data Portal

Submitted Data Validation 

Data validation is performed on data submitted to the GDC through the GDC Data Submission Processes and Tools. Submitted data is not distributed by the GDC unless it passes the following GDC data validation checks. 

  • Verification of MD5 Checksum - Comparison of the provided MD5 Checksum with the MD5 Checksum of the file submitted to the GDC. 
  • Data Type & Format Validation - Validation of biospecimen, clinical, and genomic data against the standard GDC Data Types and File Formats and GDC Data Dictionary
  • Validation of Data References - Cross check of existing Universally Unique Identifiers (UUIDs) or barcodes across biospecimen, clinical, and genomic data. 
  • Data Integrity Checks - Validation of clinical, biospecimen, and genomic data using 100+ GDC data integrity checks. Examples include.
Data Type Example Data Integrity Check
Clinical
  • Primary site and disease type are consistent for a case 
  • Year of diagnosis is greater than the year of birth, and less than the year of death 
  • If the index represents diagnosis, the days to diagnosis should be zero 
  • Days to birth and days to death are calculated accurately off of the index days to values 
  • Follow-up days to values are greater than the index days to values 
  • ...
Biospecimen
  • If associated with multiple read groups, molecular files should be associated with one aliquot 
  • Read groups should have the same experimental strategy as their associated molecular data 
  • The read group requires a read group name 
  • In addition to the required read group properties, the flow cell barcode, lane number, and multiplex barcode should be provided as this information is used to construct a platform unit (PU), which is a universally unique identifier that can be used to model various sequencing technical artifacts (see the SAM specification
  • For projects with library strategies of targeted sequencing or WXS information on the target capture protocol, a target capture kit is required 
  • ...
Genomic
  • If a read group is associated with a BAM file, the @RG ID should be present in the BAM header as the read group name. This is important for the harmonization process and will reduce the possibility of errors.
  • Submitted coverage must be consistent with coverage generated by the GDC 
  • Single cell RNA-Seq data files must follow a consistent filename format 
  • ...

GDC Data Harmonization

The GDC uses submitted genomic sequence data to create derived data products such as somatic DNA mutations, gene expression, and copy number variations. Validation of genomic data is performed using GDC Data Harmonization software and algorithms. Bioinformatics workflows are developed with ongoing input from recognized experts in the cancer genomics community. Workflows are implemented using techniques to make them reproducible, interoperable across multiple platforms, and shareable with any interested member of the community. GDC workflows are described in detail on the GDC Documentation Site and made available in the GDC GitHub Repository. Quality control checks are performed in GDC workflows and the GDC adds various summary metrics to the aligned reads for query by the user. For a complete list of the summary metrics as well as the tools used to generate them please visit the Data Dictionary Viewer.

Workflow Tools Quality Control Checks Quality Control Metrics
DNA Alignment & Somatic Variant Calling BWA, Picard Tools, GATK, MuSE, MuTect2, VarScan2, Pindel, CaVEMan, Strelka2, SvABA
  • Mapping Quality
  • Base Quality Score
  • Recalibration
  • Duplicate Marking
  • Strand Bias Filtering
  • Oxidation Damage Filtering
  • Germline Variant Filtering
  • Local-realignment Variant Normalization, Multi-caller Ensemble
  • Data Validation & Integrity Checks
  • Total Reads
  • Mapped Reads
  • Duplicated Reads
  • Mismatched Bases
  • Average Insertion Size
  • Average Read Length
  • Average Base Quality
  • Mean Coverage
  • Proportion Coverage at 10X
  • Proportion Coverage at 30X
  • Proportion Target without Coverage
  • Read Pairs on Different Chromosomes
  • Cross-sample
  • Contamination
  • Estimation Error of Cross-sample Contamination
RNA Alignment, Expression, and Gene Fusion Analysis
  • RNA-Seq: STAR, STAR Fusion, Arriba
  • scRNA-Seq: Seurat, CellRanger, STAR Solo
  • Mapping Quality
  • Normalization
  • Differential Gene Expression Analysis
  • PCA & Embedding
  • Data Validation & Integrity Checks
  • Total Reads
  • Mapped Reads
  • Duplicated Reads
  • Mismatched Bases
  • Average Insertion Size
  • Average Read Length
  • Average Base Quality
  • Read Pairs on Different Chromosomes
miRNA Alignment & Expression Analysis BWA, BCGSC miRNA Profiling
  • Mapping Quality
  • Normalization
  • Data Validation & Integrity Checks
  • Total Reads
  • Mapped Reads
  • Duplicated Reads
  • Mismatched Bases
  • Average Insertion Size
  • Average Read Length
  • Average Base Quality
    Read Pairs on Different Chromosomes
Copy Number Variation Analysis ASCAT, ABSOLUTE, GATK CNV, DNAcopy
  • B-allele Frequency
  • Tumor Purity Estimation
  • Tumor Ploidy Estimation
  • Allele-Specific Copy Number Estimation 
  • Clonality Analysis
  • Data Validation & Integrity Checks
  • Tumor Ploidy
  • Tumor Purity
  • Whole Genome Doubling
  • Cancer DNA Fraction
  • Sub-clonal Genome Fraction
Methylation Array Analysis SeSAMe
  • Genotyping Probe Removal
  • Low Quality Probe Removal
  • Normalization
  • Data Validation & Integrity Checks
Included in SeSAMe