Data Types and File Formats

The GDC provides access to data submitted from new programs, data generated by the GDC through GDC alignment and high level data generation pipelines, and data imported from existing programs.

Submitted Data

The GDC currently accepts DNA and RNA sequencing data in both FASTQ and BAM formats. Sequencing data is submitted with accompanying metadata in either simple tab-separated values (TSV) or the JavaScript Object Notation JSON format, or the latest version (currently 1.5) of the SRA XML format. Clinical and biospecimen data can be submitted in either TSV or JSON format, or as XML that is validated with respect to the latest version of NCI Biospecimen Core Resource XML Schema documents.

Generated Data

For all submitted sequence data, including BAM alignment files, the GDC generates new alignments in BAM format using the latest human reference genome GRCh38 with standard alignment pipelines. Using these standard alignments, the GDC generates high level derived data, including normal and tumor variant and mutation calls in VCF and MAF formats, and gene and miRNA expression and splice junction quantification data in TSV formats.

Imported Data

The GDC hosts and distributes previously generated data from The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and other programs. Original sequence alignments are stored in BAM format, and derived data files are stored and provided in their original formats.