GDC Data Quality

The GDC provides access to high quality data sets from NCI-supported programs and recommends use of biospecimen best practices for organizations generating datasets from non-NCI supported programs. High quality datasets are achieved through GDC processes and tools for submitted data validation and data harmonization, which includes aligning data to a common reference genome and generating higher level data.

High-Quality Tissue Samples

The GDC obtains datasets from NCI programs that ensure high quality by accepting only tissues that have extensive information on their sources and have undergone stringent material processing quality control. The GDC obtains datasets from other NCI-supported programs that adhere to high quality standards for handling biospecimens and associated analytes and recommends that non-NCI supported programs submitting data to GDC follow Biospecimen Best Practices. Organizations wishing to submit data to the GDC should visit the NCI Biorepositories and Biospecimen Research Branch (BBRB) site for recommended tissue collection strategies for generating high quality datasets.

The GDC maintains standardized biospecimen information on the tissues and samples used in each study. Information on the tissue sample, portion, analyte, and/or aliquot is extracted from submitted data, maintained in the GDC data model, and made accessible via the GDC Data Portal.

Submitted Data Validation

Data validation is performed on data submitted to the GDC through the GDC Data Submission Processes and Tools. Submitted data is not distributed by the GDC unless it passes the following GDC data validation checks.

Verification of MD5 Checksum - Comparison of the provided MD5 Checksum with the MD5 Checksum of the file submitted to the GDC.
Data Type & Format Validation - Validation of biospecimen, clinical, and genomic data against the standard GDC Data Types and File Formats and GDC Data Dictionary.
Validation of Data References - Cross check of existing Universally Unique Identifiers (UUIDs) or barcodes across biospecimen, clinical, and genomic data.
Data Integrity Checks - Validation of clinical, biospecimen, and genomic data using 100+ GDC data integrity checks. Examples include.

Data Type	Example Data Integrity Check
Clinical	Primary site and disease type are consistent for a case Year of diagnosis is greater than the year of birth, and less than the year of death If the index represents diagnosis, the days to diagnosis should be zero Days to birth and days to death are calculated accurately off of the index days to values Follow-up days to values are greater than the index days to values ...
Biospecimen	If associated with multiple read groups, molecular files should be associated with one aliquot Read groups should have the same experimental strategy as their associated molecular data The read group requires a read group name In addition to the required read group properties, the flow cell barcode, lane number, and multiplex barcode should be provided as this information is used to construct a platform unit (PU), which is a universally unique identifier that can be used to model various sequencing technical artifacts (see the SAM specification) For projects with library strategies of targeted sequencing or WXS information on the target capture protocol, a target capture kit is required ...
Genomic	If a read group is associated with a BAM file, the @RG ID should be present in the BAM header as the read group name. This is important for the harmonization process and will reduce the possibility of errors. Submitted coverage must be consistent with coverage generated by the GDC Single cell RNA-Seq data files must follow a consistent filename format ...

GDC Data Harmonization

The GDC uses submitted genomic sequence data to create derived data products such as somatic DNA mutations, gene expression, and copy number variations. Validation of genomic data is performed using GDC Data Harmonization software and algorithms. Bioinformatics workflows are developed with ongoing input from recognized experts in the cancer genomics community. Workflows are implemented using techniques to make them reproducible, interoperable across multiple platforms, and shareable with any interested member of the community. GDC workflows are described in detail on the GDC Documentation Site and made available in the GDC GitHub Repository. Quality control checks are performed in GDC workflows and the GDC adds various summary metrics to the aligned reads for query by the user. For a complete list of the summary metrics as well as the tools used to generate them please visit the Data Dictionary Viewer.

Workflow	Tools	Quality Control Checks	Quality Control Metrics
DNA Alignment & Somatic Variant Calling	BWA, Picard Tools, GATK, MuSE, MuTect2, VarScan2, Pindel, CaVEMan, Strelka2, SvABA	Mapping Quality Base Quality Score Recalibration Duplicate Marking Strand Bias Filtering Oxidation Damage Filtering Germline Variant Filtering Local-realignment Variant Normalization, Multi-caller Ensemble Data Validation & Integrity Checks	Total Reads Mapped Reads Duplicated Reads Mismatched Bases Average Insertion Size Average Read Length Average Base Quality Mean Coverage Proportion Coverage at 10X Proportion Coverage at 30X Proportion Target without Coverage Read Pairs on Different Chromosomes Cross-sample Contamination Estimation Error of Cross-sample Contamination
RNA Alignment, Expression, and Gene Fusion Analysis	RNA-Seq: STAR, STAR Fusion, Arriba scRNA-Seq: Seurat, CellRanger, STAR Solo	Mapping Quality Normalization Differential Gene Expression Analysis PCA & Embedding Data Validation & Integrity Checks	Total Reads Mapped Reads Duplicated Reads Mismatched Bases Average Insertion Size Average Read Length Average Base Quality Read Pairs on Different Chromosomes
miRNA Alignment & Expression Analysis	BWA, BCGSC miRNA Profiling	Mapping Quality Normalization Data Validation & Integrity Checks	Total Reads Mapped Reads Duplicated Reads Mismatched Bases Average Insertion Size Average Read Length Average Base Quality Read Pairs on Different Chromosomes
Copy Number Variation Analysis	ASCAT, ABSOLUTE, GATK CNV, DNAcopy	B-allele Frequency Tumor Purity Estimation Tumor Ploidy Estimation Allele-Specific Copy Number Estimation Clonality Analysis Data Validation & Integrity Checks	Tumor Ploidy Tumor Purity Whole Genome Doubling Cancer DNA Fraction Sub-clonal Genome Fraction
Methylation Array Analysis	SeSAMe	Genotyping Probe Removal Low Quality Probe Removal Normalization Data Validation & Integrity Checks	Included in SeSAMe

See our Data Model

Want to know more about how our data is organized?

Visit the GDC Data Model Page

GDC Data Quality

High-Quality Tissue Samples

Submitted Data Validation

GDC Data Harmonization

National Cancer Institute

at the National Institutes of Health