Genomic Data Harmonization

Genomic Harmonization Overview

The GDC uses submitted FASTQ or BAM formatted sequence and microarray data to generate derived analysis data. This includes analyses such as tumor sequence variant calls, RNA-Seq gene expression quantification values, and copy-number segmentation values.

Sequence data is aligned (or realigned) to the latest human genome reference. The resulting alignments are then processed to produce derived data. The alignment and derived data are available to users via the GDC Data Portal. Array data is processed using data type specific methods.

Each phase of processing is standardized into common pipelines that use open source sequence analysis tools. Because all sequence data submitted to the GDC is subjected to analysis through these standard pipelines, the analysis process is referred to as data harmonization. When data is successfully harmonized, it is made available through the GDC Data Portal and other access tools. If the process of harmonization reveals underlying issues in the data, the associated files will be recalled and will not be available through the GDC.

The genomic harmonization pipelines were developed in consultation with senior experts in the field of cancer genomics and are regularly evaluated and updated as current tools and parameter sets are improved and developed.

Reference Genome and Alignment Workflow

Reference genome alignment is the first step of the harmonization process for all sequencing-based workflows. While different alignment algorithms are used for each case depending on read length and type, all alignments are performed on the same version of the GRCh38 reference genome. See the GDC Documentation site for details on the algorithm used for each pipeline. Viral and decoy sequences are included, which draw reads that would not normally map to the human genome, provide information on the presence of oncoviruses, and allow for a more accurate alignment. The current virus decoy set contains 10 types of human viruses, including human cytomegalovirus (CMV), Epstein-Barr virus (EBV), hepatitis B (HBV), hepatitis C (HCV), human immunodeficiency virus (HIV), human herpes virus 8 (HHV-8), human T-lymphotropic virus 1 (HTLV-1), Merkel cell polyomavirus (MCV), simian vacuolating virus 40 (SV40) and human papillomavirus (HPV).

An initial alignment is performed separately on each read group, which is defined as a set of reads that originates from one Illumina sequencing lane. The subsequent set of alignments that originate from a single aliquot are then merged. Pipeline-specific details about the alignment and downstream analyses can be found in their respective section or documentation site.

GDC Data Harmonization Pipeline

GDC Pipeline Overviews

Brief summaries of the workflow used by the GDC are listed below. Each summary has a link to its corresponding section of the GDC Documentation Website. The GDC Documentation website contains details about each step of the pipeline, the command-line parameters used to run each step, and information about the corresponding files available at the GDC Data Portal.

DNA-Seq Somatic Variant Analysis

The DNA-Seq Somatic Variant Analysis pipeline identifies and characterizes somatic mutations by comparing reference alignments from tumor and normal samples from the same case. The validity of these mutations is assessed using internal algorithms and external variant databases. A co-cleaning step is implemented by recalibrating base quality scores and realigning indels for a more accurate alignment. Four separate algorithms (MuSE, Mutect2, SomaticSniper, Varscan2) are then used to perform variant calling on paired tumor/normal samples to identify somatic mutations. Variants are annotated independently and with information from external databases such as dbSNP and OMIM. All annotated variant calls from one project are then aggregated into one MAF file per variant calling pipeline. MAF files are filtered to remove any potentially erroneous or germline variant calls. After they are filtered, open-access Somatic MAFs are available to the general public, whereas the unfiltered MAFs are available only to dbGaP-authorized investigators.


See the GDC Documentation website for an overview of the:

RNA-Seq Gene Expression Analysis

The RNA-Seq Analysis pipeline quantifies protein-coding gene expression based on the number of reads aligned to each gene. A "two-pass" method is used in which RNA-Seq reads are first aligned to the reference genome to detect splice junctions. A second alignment is then performed using the information from splice junctions to increase the quality of the alignment. Read counts are measured on a gene level using HTSeq and normalized using the Fragments Per Kilobase of transcript per Million mapped reads (FPKM) and FPKM Upper Quartile (FPKM-UQ) methods with custom scripts.

RNA-Seq Gene Expression Analysis

See the GDC Documentation website for a detailed overview of the:

miRNA-Seq Analysis

The miRNA-Seq pipeline quantifies micro-RNA gene expression. The names and genomic locations of each miRNA are retrieved from miRBase, and the expression levels are measured and normalized post-alignment. Normalization is performed using the Reads per Millions (RPM) method. Expression levels for known miRNAs and observed miRNA isoforms are generated for each sample.


See the GDC Documentation website for a detailed overview of the:

Copy Number Variation Analysis

The Copy Number Variation Analysis pipeline detects duplications or deletions of contiguous chromosomal regions by measuring array intensity levels. Level 2 normalized array data is used to perform circular binary segmentation analysis with the R package DNACopy. Circular binary segmentation (CBS) analysis divides each chromosome into contiguous segments of equal copy-number and quantifies each.


See the GDC Documentation website for a detailed overview of the:

Methylation Array Analysis

At the GDC, beta values calculated from methylation array analyses are harmonized by converting previous coordinates from an older reference genome to the newer GRCh38 reference genome. Harmonized methylation array probe sets were analyzed to determine transcript and CpG Island (CGI) proximity to each associated CpG site. This metadata is associated with each methylation beta value and associated probe, which is matched to a specific CpG site.


See the GDC Documentation website for a detailed overview of the:

Pipeline Implementation

GDC pipelines are packaged as a series of Docker containers. Docker can wrap up a complete environment that contains everything a bioinformatics pipeline needs to run. This includes code, runtime, tools, and, libraries. This method significantly improves reproducibility and portability of bioinformatics software in Linux systems. Realignment annotations, such as the docker ID, time cost, and exact command used in the docker container are stored as properties of the workflow for each file created. All other QC metrics and realignment tool logs are saved as individual files in the object store. For data remediation, the GDC examines all QC results manually for problem detection. The GDC will establish criteria and implement automatic remediation steps in the workflow.