GDC FAQs

How do I query and download data from the GDC?

The GDC provides several resources for querying and downloading data from the GDC including the GDC Data Portal for querying and downloading GDC data files, the GDC Data Transfer Tool for downloading large volumes of files, and the GDC Application Programming Interface (API) for performing programmatic queries and downloads.

How does the GDC maintain secure access to controlled access data?

GDC data is stored in a secure FISMA-compliant facility. Access to controlled data requires authorization via dbGaP. See Data Access Processes and Tools for more information.

How do I obtain an account to log in to the GDC?

Generally, browsing indexed GDC metadata (such as information about the cases and files contained in the GDC Data Portal) does not require a login.

eRA Commons authentication and dbGaP authorization are required before accessing controlled data, which generally includes individually identifiable information such as low level genomic sequencing data and germline variants.

Controlled-access data users log in to the GDC using their eRA Commons accounts. The GDC then verifies that the user has authorization in dbGaP to access specific controlled datasets.

See Obtaining Access to GDC Data and Resources for more information on data download, and Obtaining Access to Submit Data for information on data submission.

Where do I go to report an issue or submit an inquiry about the GDC?

The GDC provides helpdesk support for data submission and other issues. For information on the GDC helpdesk, please visit GDC Support.

What are the system requirements for using the GDC Data Transfer Tool?

System requirements for using the GDC Data Transfer Tool are available on the GDC Data Transfer Tool page. Additional details are in the GDC Data Transfer Tool User's Guide.

What is the recommended tool and protocol for transferring large volumes of data to or from the GDC?

The GDC Data Transfer Tool is recommended for transferring large datasets to or from GDC. For additional details, please visit the GDC Data Transfer Tool User’s Guide.

When using the GDC Data Transfer Tool, is it possible to set a bandwidth limit?

The GDC Data Transfer Tool does not offer a setting to limit the bandwidth it uses.

Does the GDC Data Transfer Tool use random or sequential read/write? Does the choice of protocol make a difference?

The GDC Data Transfer Tool uses sequential read/write for each file segment that is being transferred. By default, the tool executes multipart transfers, which results in multiple parallel, sequential read or write operations. To turn off multipart transfers, users can set the number of processes to 1.

How long do GDC authentication tokens remain valid?

GDC authentication tokens remain valid for 30 days.

How is validation performed on genomic data (BAM files) submitted to the GDC?

Submitted BAM files are validated at the GDC for file integrity and format using md5sum checks, automated QC checks, and the Picard ValidateSamFiles tool. Sequencing quality is assessed using FASTQC, and additional quality metrics are gathered using tools like Picard and Samtools. Severe issues, such as high cross-sample contamination, may prevent the data from being released, but minor issues typically do not result in rejection. Instead, the GDC exposes many of the quality metrics so users may review them and do further filtering. For more details, please visit: GDC Data Harmonization.

Why does the GDC have data releases and how often do they happen?

Recurrent data releases allow the GDC to version the data and allow users to reference the GDC version number in publications. GDC currently generates releases as needed, with a release every 2-3 months as a goal.

How many bytes are there in a megabyte or gigabyte?

There has been long standing debate about prefixes for multiples of bytes. We have chosen to utilize the standard supported by the International System of Units (SI) where 1 gigabyte (GB) = 10⁹ bytes or 1 megabyte (MB) = 10⁶ bytes. This convention is also supported by the IEEE, EU, NIST, and the International System of Quantities. Where appropriate, we utilize the IEEE 1541 recommendations for binary representation where 1024³ bytes = 1 gibibyte (GiB) or 1024² bytes = 1 mebibyte (MiB).

Why are some harmonized data files missing?

The GDC processes data through several harmonization pipelines. If the process of harmonization reveals issues in the underlying data or if an error occurred during harmonization, the harmonized data files (e.g. BAMs or VCFs) will not appear in GDC data access tools.

I only see patients with ages of 90 years or less in the GDC. Why is this?

HIPAA guidelines require that patients with ages greater than 89 years be aggregated into a single age category. This is to limit the ability to positively identify these individuals. In practice this will impact the values reported in several fields. We have chosen to accurately display the age at diagnosis, but fields that give dates or time periods after this benchmark may be compressed. This may include such fields as "Days to last follow up", "Days to last known disease status", "Days to recurrence", "Days to death", and "Year of death". Other fields found in clinical supplement files may also be impacted. In the Data Portal you will only see ages of less than or equal to 90. Individuals over 89 will all appear as 90 years old.

What web browsers are supported by the GDC?

The following web browsers are supported for use with the GDC Data Portal, Submission Portal, Website, and Documentation site.

Most recent supported stable version of Microsoft Edge
Most recent stable version of Google Chrome
Most recent stable version of Mozilla Firefox

How do I obtain access to a specific controlled dataset?

The GDC provides access to both open and controlled datasets. To access controlled datasets, users must obtain appropriate authorization through dbGaP. See Obtaining Access to Controlled Data for instructions on applying for access through dbGaP.

How do I avoid timeouts and transfer interruptions when downloading large datasets from the GDC Data Portal?

The GDC Data Portal is a web-based application that is limited by browser and network constraints. If a system timeout occurs when downloading files, please use the GDC Data Transfer Tool or contact the GDC Help Desk.

Where can I find the target and bait/probe files (BED files) that describe the capture kit used in an exome sequencing experiment?

Capture kit information is provided by the GDC API at the read group level, where available. In some cases, additional information may be available in SRA XML files.

The relevant read_group properties returned by the GDC API are:

target_capture_kit_name
target_capture_kit_catalog_number
target_capture_kit_vendor
target_capture_kit_target_region

The target_capture_kit_target_region field provides a URL for the capture kit target file, distributed by the kit manufacturer or by the research program. Bait/probe files can sometimes be found at the same URL; or a URL to the bait/probe file may be available in the SRA XML file.

Note: Some BAM files include information from multiple read groups, and sometimes read groups produced with different capture kits are included in the same BAM file. Tools are available for splitting BAM files into read groups, e.g. bamutil.

Note: Target and bait/probe files may use an older reference genome, so liftover may be required for certain applications.

Are unmapped reads available in the GDC Data Portal?

Harmonized BAM files from RNA-seq and DNA-seq experiments will contain both mapped and unmapped reads, if available. Unmapped reads are not distributed separately.

How can I access GDC sequencing data in FASTQ format?

Raw sequencing files submitted to the GDC are processed using GDC Genomic Data Alignment pipelines. The processed data are made available in the GDC Data Portal as BAM files containing aligned reads and unmapped reads (if available). No reads are hard-clipped, but reads that were flagged as "failed" during an Illumina sequencing run are discarded.

Third-party tools such as biobambam2 or Samtools fastq can convert these files to FASTQ sequencing data. Note that DNA-Seq quality scores are modified during the score recalibration co-cleaning step, so third-party tool parameters must be set to retrieve the original scores (biobambam2: tryoq=1; samtools fastq: -O). Because GDC harmonized BAM files may contain multiple read groups, the conversion parameter should be set to retain read group IDs in the generated FASTQ files (biobambam2: outputperreadgroup=1; samtools: samtools split).

What is the difference between files that have the same filename?

The file detail page and the metadata files accessible from that page (if available) can be used to determine the difference between files that share the same filename. For example, the files may be associated with different aliquots, or different patients.

How can I download BAM index files (BAI files) using the API?

BAI files are included with the download when using the GDC Data Transfer Tool to download BAM files.

When using the API to download BAM files, BAI files will only be included if the related_files=true parameter is specified together with the BAM UUID, for example:

https://api.gdc.cancer.gov/data/b0986067-67c0-40a9-afeb-264fcddebe96?related_files=true

Alternatively, users can determine the BAI file UUID from the API files endpoint by supplying the BAM UUID. The BAI file UUID can then be used to download the BAI file from the data endpoint.

https://api.gdc.cancer.gov/files/b0986067-67c0-40a9-afeb-264fcddebe96?pretty=true&expand=index_files

https://api.gdc.cancer.gov/data/ff095510-032e-4d44-9310-d02d1f7d7597

Note: BAI files are not available for sliced BAM files.

Does the GDC provide access to follow-up (i.e. longitudinal) data?

The availability of follow-up data is specific to the project and associated study.

For the Multiple Myeloma Research Foundation (MMRF) Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile (CoMMpass) study, longitudinal information was generated to track patients over the course of their disease. This data is available in the GDC by viewing the clinical follow-up data available for download on each case page in the GDC Data Portal or by querying the GDC API.

For TCGA, follow-up data is available for specific TCGA studies and made available for download in associated clinical supplement files (i.e. clinical XML, biotabs). Follow-up data can be different for the different TCGA studies.

Why does the treatment data appear to be incomplete and what treatment data is available in the GDC?

Submitting treatment data is optional as not all projects are associated with treatment studies. For TCGA projects, for example, not all projects and cases have treatment data. For TCGA projects with treatment data, information is available in applicable clinical supplement files (i.e. clinical XML, biotabs). For other project associated with treatment studies in which the treatment data has been submitted to the GDC, treatment data is available for download in JSON and TSV format. These studies may also contain clinical supplement files.

How do I access data from TCGA marker or other landmark cancer genomics papers?

The TCGA marker and other landmark cancer genomics papers, as well as associated supplemental files, are available on the GDC Publication Pages. The Publication Pages provide access to publication information and supplementary files.

Why is the data maintained in cBioPortal, Broad Firehose, or the Seven Bridges Cancer Genomics Cloud different from the GDC data?

The GDC harmonizes data across projects. This includes aligning the genomic data to a common reference genome (HG38) and generating higher level data using GDC bioinformatics pipelines. Other repositories may process the data differently.

For example, TCGA data in cBioPortal uses the original mutation data generated by the individual TCGA sequencing centers. The source of the data is the Broad Firehose (or the publication pages for data that matches a specific manuscript). These data are usually a combination of two mutation callers, but they differ by center (typically a variant caller like MuTect plus an indel caller), and sequencing centers have modified their mutation calling pipelines over time. TCGA data in the GDC is harmonized with the latest reference genome (GRCh38). Mutations are called using four variant callers: MuTect, VarScan2, MuSE, and Pindel.

What is the difference between tissue "collection" and tissue "procurement" in TCGA data?

TCGA “collection” represents the collection of the sample for TCGA, whereas “procurement” represents the removal of tissue from the patient.

Where can I find clinical data elements specific to my cancer research of interest?

The GDC supports the submission of clinical and biospecimen supplements. Supplemental files can be downloaded from the GDC by searching for the Data Type "Clinical Supplement" or "Biospecimen Supplement" from the facet search in the GDC Data Portal Repository. For TCGA data, the supplement data is provided in XML documents and tab delimited files (biotabs). These files, in varying degrees, provide information on marker status (e.g. EBV status), treatment regimen, slide magnification, histology distinctions, and staging questions.

What data types were updated in DR 32 (GENCODE Update Release)?

Replaced all RNA-Seq data including: Alignments, Gene Expression (STAR) + New Normalization, Transcript Fusion
Removed HTSeq Files
Re-harmonized TCGA data to use the newer pipeline

Generated and versioned new annotated somatic mutations and Ensemble MAFs
Re-harmonized TCGA data to use the newer pipeline (alignments + mutation calls)

Generated and versioned structural variant and gene level copy number data

Re-harmonized TCGA methylation data to use the new SeSAMe pipeline

Generated and versioned CPTAC-3 scRNA-Seq data

Replaced gene level copy number files for TCGA with those harmonized using ASCAT
Replaced somatic mutation files for FM-AD and transitioned to aliquot-level MAFs
Replaced all GENIE files

Why are there fewer open access TCGA mutations in DR 32 (GENCODE Update Release)?

The primary reasons for the fewer open-access mutations are from two strategies that improve quality: 1) TCGA is now using a 2-caller ensemble, instead of a single caller; 2) Removal of variants outside of the target capture region, instead of a combined “target capture + GAF exonic region”. Additionally, TCGA was the original project in which GDC open-access variants were produced and used variant rescue steps that only applied to TCGA. To keep the TCGA variant-calling pipeline consistent across projects, GDC is no longer rescuing MC3 and TCGA validation variants.

Can GENCODE v22 data still be downloaded from the GDC Data Portal?

Although GENCODE v22 data cannot be browsed in the GDC Data Portal, it can still be downloaded using the GDC Data Transfer Tool or API. You will need to either have a previous manifest or use known UUIDs to download v22 files.

How often does the GDC update the workflow/reference genome? If the GDC updates the workflow/reference genome, does the GDC re-process all data sets?

For the reference genome, the GDC has been using an augmented version of GRCh38.p2 (with additional decoy sequences and virus sequences) since inception. The GDC does not use alternative contigs, and only derives high-level data from the major chromosomes, so the same reference genome is used for both gene model GENCODE v22 (from Data Release 1 to 31) and GENCODE v36 (from Data Release 32). As future versions of the reference genome are released, e.g., GRCh39, the GDC will evaluate the benefits of updating data to utilize the new version. By updating the reference genome, the GDC would expect to re-process all data sets. For information on the reference genome used by the GDC, please refer to the GDC Reference Files.

For workflow updates, the GDC prefers to keep the workflow stable, and will not update unless there are necessary updates such as updates of the reference genome or gene model, or major algorithm updates in the tools that could result significant changes in the generated data. When workflow updates are actually needed, the GDC categorizes them as either major updates or minor updates depending on whether the update significantly affects the output data. The GDC will re-process all existing data sets in major workflow updates, and such examples include transitioning the RNA-Seq genomic BAM alignment workflow into a new version that generates three BAMs and STAR counts; and updating the MAF workflow to add additional functions to the MAF files. Minor updates mostly happen to resolve bugs, security issues, and/or compatibility issues. For example, the GDC DNA-Seq alignment workflow has been updated several times to address quality issues from various submitted data; however, because the main alignment algorithm remains almost the same, the GDC does not need to re-process all the data sets for these minor updates.

Does the GDC provide access to germline variants?

Any germline SNP calls are not available for exploration in the GDC Data Portal. Instead, alignments for germline data are available under controlled access. Users with appropriate access may use the alignments to generate germline variants.

Some somatic variants callers, such as MuTect2, also output somatic calls with some level of germline possibilities, such as those labelled as "germline_risk". Please note that these calls are, by no means, germline variants. They are somatic calls with boundary probability of germline risks.

Why did the GDC remove SomaticSniper?

The SomaticSniper whole exome variant caller was one of the first generation somatic mutation callers developed by the scientific community. It works the best with blood cancer that has high level of tumor-in-normal contaminations, but is often overly permissive for solid tumors. Since our first data release in 2016, the GDC has gradually adopted newer tools or new tool versions, and has transited the focus of somatic variant calling from any single caller to multi-caller ensemble.

After comparing ensemble calls with and without SomaticSniper and also receiving feedback from the authors of SomaticSniper, the GDC decided to remove this tool from our production in Data Release 35. The GDC still maintains other four whole exome variant callers, including MuSE, MuTect2, Pindel, and VarScan2.

Why do some projects with WGS structural variant data have BEDPE files and some projects do not?

Generally any WGS data should have associated structural variant files (BEDPE) except in the cases in which either there are no tumor/normal matches or when variant calling has not been implemented yet.

In the GDC Data Portal, when I filter by Experimental Strategy (e.g., RNA-Seq) in the Cohort Builder Available Data filter, why is this filter not working when I navigate to Repository?

Filtering in Cohort Builder results in a set of cases. Cases typically have available data for multiple Experimental Strategies (e.g., cases with RNA-Seq data can also have DNA-Seq data). As such, when navigating to Repository additional data besides RNA-Seq data is displayed. Repository filters are file based. When filtering Repository by Experimental Strategy (e.g., RNA-Seq) only files associated with the Experimental Strategy are displayed.

In the GDC Data Portal, where is the histogram of top frequently mutated genes for a cohort?

To view the top frequently mutated genes for a cohort, first build a cohort using the Cohort Builder and then select the Mutation Frequency tool in the Analysis Center

In Cohort Builder, why are there multiple tissue or organ of origins that appear to not be associated with the primary diagnosis?

Filtering in Cohort Builder results in a set of cases. A case can have multiple diagnosis and tissues or organ of origins may be related to a secondary diagnosis.

Why do some genes show no expression in STAR results across all samples, even though I can see mapped reads in the raw RNA-Seq data?

STAR gene expression quantification excludes reads that are mapped to multiple different genes. This can cause some genes to appear with zero expression in the final counts, even if mapped reads are present in the raw data.

One common reason for this is gene overlap. These genes often have their exons entirely encompassed within other genes, and in such cases, STAR cannot assign reads to them because they are ambiguous. To check if a gene falls into this category, you can refer to the following lists: Stranded Counting Overlap Gene List: overlap.gene.stranded.tsv, and Strandless Counting Overlap Gene List: overlap.gene.strandless.tsv.

Why can’t I download data immediately after receiving dbGaP access to a study?

It may take up to 24 hours for the GDC to sync with dbGaP. Please allow this time after receiving dbGaP authorization before attempting to access the study’s data on the GDC.

Why does TCGABiolinks no longer work when retrieving diagnosis?

TCGA clinical data was expanded in GDC Data Releases 42 and 43. TCGA clinical data used to have one diagnosis per case. With the clinical data expansion, it is possible that a TCGA case has multiple diagnoses. This could be due to pre-enrollment diagnoses or other reasons. To properly query for the diagnosis information associated with the molecular data, the primary disease flag should be set to true (i.e., diagnosis_is_primary_disease = true).

NCI Press Offices

National Cancer Institute

at the National Institutes of Health