GDC Reference Files

Reference files used by the GDC data harmonization and generation pipelines are provided below. MD5 checksums are provided for verifying file integrity after download. Additional files are also included to allow for reproduction of GDC pipeline analyses.

GRCh38.d1.vd1 Reference Sequence


  • md5: 3ffbcfe2d05d43206f57f81ebb251dc9

This reference genome is used by the GDC for all sequencing and array based analyses. This file is composed of the following sequences:

Index Files

Index files are built from the GDC reference genome and are used with the software listed below.

GDC.h38.d1.vd1 BWA Index Files

GDC.h38.d1.vd1 GATK Index Files

GDC.h38.d1.vd1 STAR2 Index Files

Annotation Files

Annotation files contain information about the position and identity of regions in the reference genome. They allow software to calculate expression values.

GDC.h38 miRNA database files

GDC.h38 GENCODE v22 GTF (used in RNA-Seq alignment and by HTSeq)

GDC.h38 Flattened GENCODE v22 GFF (used by DEXSeq for exon quantification)

Miscellaneous Files

Genome Annotation Files for Legacy TCGA Data

SNP6 GRCh38 Remapped Probeset File for Copy Number Variation Analysis

If you are using Masked Copy Number Segment for GISTIC analysis, please only keep probesets with freqcnv = FALSE

SNP6 GRCh38 Liftover Probeset File for Copy Number Variation Analysis