Reference files used by the GDC data harmonization and generation pipelines are provided below. MD5 checksums are provided for verifying file integrity after download. Additional files are also included to allow for reproduction of GDC pipeline analyses.
GRCh38.d1.vd1 Reference Sequence
- md5: 3ffbcfe2d05d43206f57f81ebb251dc9
- file size: 875.3 MB
This reference genome is used by the GDC for all sequencing and array based analyses. This file is composed of the following sequences:
- GCA_000001405.15_GRCh38_no_alt_analysis_set
- Sequence Decoys (GenBank Accession GCA_000786075)
- Virus Sequences
Index Files
Index files are built from the GDC reference genome and are used with the software listed below.
GDC.h38.d1.vd1 BWA Index Files
- GRCh38.d1.vd1_BWA.tar.gz
- md5: 015f5223bddd93b6e8f7a038c171f7be
- file size: 3.2 GB
GDC.h38.d1.vd1 GATK Index Files
- GRCh38.d1.vd1_GATK_indices.tar.gz
- md5: f64be73587a7f376c0d8353f1636dca7
- file size: 104 KB
GDC.h38.d1.vd1 STAR2 Index Files (v36)
- star-2.7.5c_GRCh38.d1.vd1_gencode.v36.tgz
- md5: acafb76bba5e3e80eb028dc05f002ffc
- file size: 25 GB
GDC.h38.d1.vd1 STAR2 Index Files (v22)
- star.index.genome.d1.vd1.gtfv22.tar.gz
- md5: 7c2e6bd5767239c7c9eb618cd03bcadb
- file size: 24.9 GB
Annotation Files
Annotation files contain information about the position and identity of regions in the reference genome. They allow software to calculate expression values.
GDC.h38 miRNA database files
- mirna_database.tar.gz
- md5: d078aec8561d72b52e475e3f932865e4
- file size: 185 MB
GDC.h38 GENCODE v36 GTF
- gencode.v36.annotation.gtf.gz
- md5: c03931958d4572148650d62eb6dec41a
- file size: 44.5 MB
GDC.h38 GENCODE v22 GTF
- gencode.v22.annotation.gtf.gz
- md5: 291330bdcff1094bc4d5645de35e0871
- file size: 39.0 MB
GDC.h38 GENCODE TSV (v22)
- gencode.gene.info.v22.tsv
- md5: 0a3f1d9b0a679e2a426de36d8d74fbf9
- file size: 6 MB
Miscellaneous Files
Methylation Array Gene Annotation File (v36)
- EPIC.hg38.manifest.gencode.v36.tsv.gz
- md5: 071d925096dce531739cfb955605217b
- file size: 29.9 MB
- HM27.hg38.manifest.gencode.v36.tsv.gz
- md5: 9d4e032a9bd13127ffb9782f66450fd6
- file size: 1.7 MB
- HM450.hg38.manifest.gencode.v36.tsv.gz
- md5: e163fc110043abb5a7ef623816383bb9
- file size: 17.6 MB
Antibody Description Files for TCGA RPPA Data (v36)
- TCGA_antibodies_descriptions.gencode.v36.tsv
- md5: 1e3bed697ed431b16ba2a4dae8f52fd1
- file size: 35 KB
Antibody Description Files for TCGA RPPA Data (v22)
- TCGA_antibodies_descriptions.gencode.v22.tsv
- md5: b5d84afabed98a034121372df01d726f
- file size: 35 KB
Genome Annotation Files for Legacy TCGA Data
- TCGA.hg19.June2011.gaf
- md5: b9e0c2b81736d82d62bb6ab8cc517644
- file size: 629 MB
- TCGA.hg18.Feb2011.gaf
- md5: 9a5c05c5b836ec19517871f30f2bccba
- file size: 558 MB
SNP6 GRCh38 Remapped Probeset File for Copy Number Variation Analysis
- snp6.na35.remap.hg38.subset.txt.gz
- md5: 051457f33d264d74825a41d6b0378ac4
- file size: 14.1 MB
- Data Release 12 and after
If you are using Masked Copy Number Segment for GISTIC analysis, please only keep probesets with freqcnv = FALSE
SNP6 GRCh38 Liftover Probeset File for Copy Number Variation Analysis
- snp6.na35.liftoverhg38.txt.zip
- md5: 0f982112bc81f31f1ad49a785a10305f
- file size: 14.4 MB
- Before Data Release 12
GDC VEP Cache File
- homo_sapiens.tar.gz
- md5: 57064d0b081f0b99b2663977121f23c5
- file size: 4.6 GB
GDC Panel of Normal (PON) Files used for Variant Calling
THESE FILES ARE CONTROLLED AND REQUIRE DBGAP ACCESS TO DOWNLOAD. YOU WILL NEED TO USE THE GDC-CLIENT TO DOWNLOAD THESE.
For Tumor-Only Variant Calling Pipeline
gatk4_mutect2_4136_pon.vcf.tar
- uuid: 6c4c4a48-3589-4fc0-b1fd-ce56e88c06e4
- md5: 725d891e02ca93edaabac8b09322439e
- file size: 92 MB
For Tumor / Normal Variant Calling Pipeline
MuTect2.PON.4136.vcf.tar
- uuid: 6b45b9f7-893e-4947-83b6-db0402471e23
- md5: d13a138dcf4e9f1ec8a69ac3a4f64ca9
- file size: 121 MB
MuTect2.PON.5210.vcf.tar
- uuid: 726e24c0-d2f2-41a8-9435-f85f22e1c832
- md5: 5b5c1c3e208aa9a403cc4a8ff39e7f1f
- file size: 146 MB