Data Types and File Formats

The GDC provides access to data submitted from new programs, data generated by the GDC through GDC alignment and high level data generation pipelines, and data imported from existing programs.

Submitted Data

The GDC currently accepts DNA and RNA sequencing data in both FASTQ and BAM formats. Sequencing data is submitted with accompanying metadata in either simple tab-separated values (TSV) or the JavaScript Object Notation JSON format, or the latest version (currently 1.5) of the SRA XML format. Clinical and biospecimen data can be submitted in either TSV or JSON format, or as XML that is validated with respect to the latest version of NCI Biospecimen Core Resource XML Schema documents.

Submitted Data Types

Generated Data

For all submitted sequence data, including BAM alignment files, the GDC generates new alignments in BAM format using the latest human reference genome GRCh38 with standard alignment pipelines. Using these standard alignments, the GDC generates high level derived data, including normal and tumor variant and mutation calls in VCF and MAF formats, and gene and miRNA expression and splice junction quantification data in TSV formats.

Generated Data Types

See our Data Model

Want to know more about how our data is organized?

Visit the GDC Data Model Page

Data Types and File Formats

Submitted Data

Generated Data

National Cancer Institute

at the National Institutes of Health