GDC FAQs

What is the NCI Genomic Data Commons (GDC)?

The NCI Genomic Data Commons (GDC) is the next generation repository and cancer knowledge base supporting the import and standardization of genomic and clinical data from cancer research programs (e.g. TCGA, TARGET, CGCI), the harmonization of sequence data to the genome / transcriptome, and the application of state-of-the art methods for derived data (e.g. mutation calls, structural variants, etc.).

The NCI part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services (DHHS) established the GDC to provide the cancer research community with a data service supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit About the GDC for additional information.

What are the goals of the GDC?

The primary goal of the GDC is to provide the cancer research community with a unified repository and cancer knowledge base supporting cancer genomic studies. The cancer knowledge base enables the identification of low-frequency cancer drivers, assists in defining genomic determinants of response to therapy, and informs the composition of clinical trial cohorts sharing targeted genetic lesions. Working towards this goal, the GDC provides resources supporting the receipt, quality control, integration, storage, and redistribution of standardized cancer genomic data sets derived from cancer studies. Please visit the GDC Overview for additional information.

How can I collaborate with the GDC?

The GDC welcomes collaborations with organizations conducting research in or providing informatics supporting cancer genomics. Organizations interested in collaborating with the GDC should contact GDC Support.

Are there restrictions on the use of GDC data in publications?

No. All GDC data can be used in publications or presentations. For additional questions about the use of GDC data, or to explore opportunities for collaboration, please contact GDC Support.

How do I cite the NCI GDC?

Please credit the NCI Genomic Data Commons (GDC) in your manuscript by citing the following paper about the GDC in your manuscript:

Heath, A.P., Ferretti, V., Agrawal, S. et al. The NCI Genomic Data Commons. Nat Genet 53, 257-262 (2021). https://doi.org/10.1038/s41588-021-00791-5

When citing individual projects, please refer to the attribution policies of the project when available.

How do I query and download data from the GDC?

The GDC provides several resources for querying and downloading data from the GDC including the GDC Data Portal for querying and downloading GDC data files, the GDC Data Transfer Tool for downloading large volumes of files, and the GDC Application Programming Interface (API) for performing programmatic queries and downloads.

How do I submit data into the GDC?

Information on the data submission processes and tools are available on the GDC Data Submission Processes and Tools page. Detailed instructions for submitting data into the GDC are provided in the GDC Data Submission Portal User's Guide. Per GDC Policy, organizations interested in submitting data into the GDC must first apply for data submitter access through the NIH database of Genotypes and Phenotypes (dbGaP).

What data types and file formats does the GDC support?

Please visit the GDC Data Types and File Formats for a list of the standard data types supported by the GDC.

What reference genome is the GDC harmonized against?

The GDC is harmonized against GRCh38. Please see GDC Data Harmonization for additional information on the GDC pipelines for re-aligning genomic data.

How does the GDC generate high level data?

The GDC generates high level data for germline and somatic genotyping, RNA-Seq quantification and structural analysis, SNP Array Genotyping and CNV Calls, and variant annotations. Please visit GDC Data Harmonization for additional information on the GDC high level data generation pipelines.

How does the GDC maintain secure access to controlled access data?

GDC data is stored in a secure FISMA-compliant facility. Access to controlled data requires authorization via dbGaP. See Data Access Processes and Tools for more information.

How do I obtain an account to log in to the GDC?

Generally, browsing indexed GDC metadata (such as information about the cases and files contained in the GDC Data Portal) does not require a login.

eRA Commons authentication and dbGaP authorization are required before accessing controlled data, which generally includes individually identifiable information such as low level genomic sequencing data and germline variants.

Controlled-access data users log in to the GDC using their eRA Commons accounts. The GDC then verifies that the user has authorization in dbGaP to access specific controlled datasets.

See Obtaining Access to GDC Data and Resources for more information on data download, and Obtaining Access to Submit Data for information on data submission.

Where do I go to report an issue or submit an inquiry about the GDC?

The GDC provides helpdesk support for data submission and other issues. For information on the GDC helpdesk, please visit GDC Support.

How do I create an advanced search query?

Users can perform advanced SQL-like queries using the GDC Data Portal Search interface. Instructions on using the GDC Data Portal Search interface are available in the GDC Data Portal User's Guide.

What are the system requirements for using the GDC Data Transfer Tool?

System requirements for using the GDC Data Transfer Tool are available on the GDC Data Transfer Tool page. Additional details are in the GDC Data Transfer Tool User's Guide.

How do I register my project with the GDC?

Once the project has been registered through dbGaP please contact the GDC Helpdesk for assistance with setting up a new project.

Where do I go to find code examples for using the GDC API?

The GDC provides code examples in the GDC Application Programming Interface (API) User's Guide

When is GDC maintenance performed?

The GDC maintenance window is semi-monthly occurring Saturday, from 8:00 am to 4:00 pm CST / 9:00 am to 5:00 pm EST.

What is the recommended tool and protocol for transferring large volumes of data to or from the GDC?

The GDC Data Transfer Tool is recommended for transferring large datasets to or from GDC. For additional details, please visit the GDC Data Transfer Tool User’s Guide.

When using the GDC Data Transfer Tool, is it possible to set a bandwidth limit?

The GDC Data Transfer Tool does not offer a setting to limit the bandwidth it uses.

Does the GDC Data Transfer Tool use random or sequential read/write? Does the choice of protocol make a difference?

The GDC Data Transfer Tool uses sequential read/write for each file segment that is being transferred. By default, the tool executes multipart transfers, which results in multiple parallel, sequential read or write operations. To turn off multipart transfers, users can set the number of processes to 1.

How long do GDC authentication tokens remain valid?

GDC authentication tokens remain valid for 30 days.

What steps must be taken in dbGaP before data can be submitted to the GDC?

The study and Subject IDs must be registered in dbGaP. For additional details, please visit: Obtaining Access to Submit Data.

How is validation performed on genomic data (BAM files) submitted to the GDC?

The GDC validates genomic data (BAM files) using FASTQC and Picard. For additional details, please visit: GDC Data Harmonization.

What is the process for uploading, submitting, and releasing data in the GDC?

Uploaded and validated data is put in a workspace until the user formally submits the data to the GDC. This allows users to interact with the data before submitting. Once the data is submitted, the GDC will process applicable datasets (e.g. harmonize molecular data and generate high level data). After processing has been completed, the data is made publicly available according to GDC Data Sharing Policies. The data becomes accessible through GDC tools (GDC Data Portal, GDC APIs) on open or controlled access basis according to the dbGaP authorization policies associated with the data set. For additional information, please visit: GDC Data Submission Processes and Tools.

Why does the GDC have data releases and how often do they happen?

Recurrent data releases allow the GDC to version the data and allow users to reference the GDC version number in publications. GDC currently generates releases as needed, with a release every 2-3 months as a goal.

Where can I find more information about the GDC data model?

The GDC employs a hierarchical data model which requires metadata and files to be attached only at particular nodes or points in the hierarchy. If you have questions, please review the GDC Data Model or contact GDC Support.

How many bytes are there in a megabyte or gigabyte?

There has been long standing debate about prefixes for multiples of bytes. We have chosen to utilize the standard supported by the International System of Units (SI) where 1 gigabyte (GB) = 10⁹ bytes or 1 megabyte (MB) = 10⁶ bytes. This convention is also supported by the IEEE, EU, NIST, and the International System of Quantities. Where appropriate, we utilize the IEEE 1541 recommendations for binary representation where 1024³ bytes = 1 gibibyte (GiB) or 1024² bytes = 1 mebibyte (MiB).

Why are some harmonized data files missing?

The GDC processes data through several harmonization pipelines. If the process of harmonization reveals issues in the underlying data or if an error occurred during harmonization, the harmonized data files (e.g. BAMs or VCFs) will not appear in GDC data access tools.

I only see patients with ages of 90 years or less in the GDC. Why is this?

HIPAA guidelines require that patients with ages greater than 89 years be aggregated into a single age category. This is to limit the ability to positively identify these individuals. In practice this will impact the values reported in several fields. We have chosen to accurately display the age at diagnosis, but fields that give dates or time periods after this benchmark may be compressed. This may include such fields as "Days to last follow up", "Days to last known disease status", "Days to recurrence", "Days to death", and "Year of death". Other fields found in clinical supplement files may also be impacted. In the Data Portal you will only see ages of less than or equal to 90. Individuals over 89 will all appear as 90 years old.

What web browsers are supported by the GDC?

The following web browsers are supported for use with the GDC Data Portal, Submission Portal, Website, and Documentation site.

Most recent supported stable version of Microsoft Edge
Most recent stable version of Google Chrome
Most recent stable version of Mozilla Firefox

How do I obtain access to a specific controlled dataset?

The GDC provides access to both open and controlled datasets. To access controlled datasets, users must obtain appropriate authorization through dbGaP. See Obtaining Access to Controlled Data for instructions on applying for access through dbGaP.

How do I avoid timeouts and transfer interruptions when downloading large datasets from the GDC Data Portal?

The GDC Data Portal is a web-based application that is limited by browser and network constraints. If a system timeout occurs when downloading files, please use the GDC Data Transfer Tool or contact the GDC Help Desk.

Why do the metadata files I am trying to submit fail to validate?

The GDC Data Submission Portal checks XML, JSON, and TSV metadata files for validity at the time they are submitted. If your files fail to validate, please check the error report and review the GDC Data Dictionary for troubleshooting these errors. Additional information on supported files and formats can be found on the GDC Data Model and File Formats pages, and in the GDC Data Submission Portal User's Guide.

Where can I download TCGA protein array (RPPA) data?

TCGA RPPA data is available in the GDC Data Portal.

Where can I download TCGA DNA methylation data?

DNA methylation data collected by TCGA has been harmonized using the SeSAMe pipeline and are available at the GDC Data Portal.

Where can I find the target and bait/probe files (BED files) that describe the capture kit used in an exome sequencing experiment?

Capture kit information is provided by the GDC API at the read group level, where available. In some cases, additional information may be available in SRA XML files.

The relevant read_group properties returned by the GDC API are:

target_capture_kit_name
target_capture_kit_catalog_number
target_capture_kit_vendor
target_capture_kit_target_region

The target_capture_kit_target_region field provides a URL for the capture kit target file, distributed by the kit manufacturer or by the research program. Bait/probe files can sometimes be found at the same URL; or a URL to the bait/probe file may be available in the SRA XML file.

Note: Some BAM files include information from multiple read groups, and sometimes read groups produced with different capture kits are included in the same BAM file. Tools are available for splitting BAM files into read groups, e.g. bamutil.

Note: Target and bait/probe files may use an older reference genome, so liftover may be required for certain applications.

Are unmapped reads available in the GDC Data Portal?

Harmonized BAM files from RNA-seq and DNA-seq experiments will contain both mapped and unmapped reads, if available. Unmapped reads are not distributed separately.

How can I access GDC sequencing data in FASTQ format?

Raw sequencing files submitted to the GDC are processed using GDC Genomic Data Alignment pipelines. The processed data are made available in the GDC Data Portal as BAM files containing aligned reads and unmapped reads (if available). No reads are hard-clipped, but reads that were flagged as "failed" during an Illumina sequencing run are discarded.

Third-party tools such as biobambam2 or Samtools fastq can convert these files to FASTQ sequencing data. Note that DNA-Seq quality scores are modified during the score recalibration co-cleaning step, so third-party tool parameters must be set to retrieve the original scores (biobambam2: tryoq=1; samtools fastq: -O). Because GDC harmonized BAM files may contain multiple read groups, the conversion parameter should be set to retain read group IDs in the generated FASTQ files (biobambam2: outputperreadgroup=1; samtools: samtools split).

Why might variants found in TCGA-generated MAFs be missing from the GDC open access MAF files?

Some of the reasons particular mutations may have been removed include updates to third party databases, more conservative germline-masking rules by the GDC, and different mutation calling pipelines and versions. Despite these differences, the GDC recaptures over 97% of TCGA-validated variants in the controlled-access MAF files. The GDC suggests using controlled-access MAF files if important variants cannot be found in somatic MAF files.

What is the difference between files that have the same filename?

The file detail page and the metadata files accessible from that page (if available) can be used to determine the difference between files that share the same filename. For example, the files may be associated with different aliquots, or different patients.

How can I download BAM index files (BAI files) using the API?

BAI files are included with the download when using the GDC Data Transfer Tool to download BAM files.

When using the API to download BAM files, BAI files will only be included if the related_files=true parameter is specified together with the BAM UUID, for example:

https://api.gdc.cancer.gov/data/53f4ad60-0777-409c-a34d-ca4442dc9c44?related_files=true

Alternatively, users can determine the BAI file UUID from the API files endpoint by supplying the BAM UUID. The BAI file UUID can then be used to download the BAI file from the data endpoint.

https://api.gdc.cancer.gov/files/53f4ad60-0777-409c-a34d-ca4442dc9c44?pretty=true&expand=index_files

https://api.gdc.cancer.gov/data/60cefd89-b428-46b7-b5b0-3b6e2743ab20

Note: BAI files are not available for sliced BAM files.

From what data are mutation-based visualization features derived?

The mutation-based visualization features are derived from open-access MAF files produced by GDC variant-calling pipelines.

Why are there some projects without data analysis and visualization features?

Data analysis and visualization features are only available for projects which maintain open-access MAF files. Programs such as TARGET maintain controlled-access MAF files only. As such, data analysis and visualization cannot be applied to TARGET projects.

In the Most Frequent Mutations table for the VEP impact score, which algorithm in the VEP is the GDC using to determine “H" or “M”?

The IMPACT is categorized by the Sequencing Ontology type of the variants that is also compatible to snpEff. The VEP IMPACT rating is a separate rating given for compatibility with other variant annotation tools (e.g. snpEff). Basically, each category is associated with a set of SO terms:

HIGH: The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay: transcript_ablation, splice_acceptor_variant, splice_donor_variant, stop_gained, frameshift_variant, stop_lost, start_lost, transcript_amplification
MODERATE: A non-disruptive variant that might change protein effectiveness: inframe_insertion, inframe_deletion, missense_variant, protein_altering_variant, regulatory_region_ablation
LOW: Assumed to be mostly harmless or unlikely to change protein behavior: splice_region_variant, incomplete_terminal_codon_variant, stop_retained_variant, synonymous_variant
MODIFIER: Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact: coding_sequence_variant, mature_miRNA_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, non_coding_transcript_exon_variant, intron_variant, NMD_transcript_variant, non_coding_transcript_variant, upstream_gene_variant, downstream_gene_variant, TFBS_ablation, TFBS_amplification, TF_binding_site_variant, regulatory_region_amplification, feature_elongation, regulatory_region_variant, feature_truncation, intergenic_variant

Details about predicted data in variations are available at ENSEMBL

How is survival analysis calculated?

Survival analysis in the GDC uses a Kaplan‑Meier estimator:

Survival Analysis Formula

S(t ) is the estimated survival probability for any particular one of the t time periods
n is the number of subjects at risk at the beginning of time period t
d is the number of subjects who die during time period t

Please refer to the GDC Data Portal User's Guide Projects for additonal information

Can I use the GDC Application Programming Interface (API) to retrieve data sets associated with visualizations?

Yes. The GDC provides additional analysis endpoints to retrieve data sets associated with visualizations. Analysis endpoints include: survival, top_cases_counts_by_genes, top_mutated_genes_by_project, top_mutated_cases_by_gene, top_mutated_cases_by_ssm, and mutated_cases_count_by_project.

Please refer to the GDC API User's Guide Analysis Section for additional information.

Where can I find information on the format of GDC MAF Files?

Please refer to the GDC MAF File Specification to obtain detailed information on the format of GDC MAF files.

On the GDC Project summary page or Exploration/Gene tab, why are the # of Mutations sometimes less than the # Affected Cases?

The “# Mutations” column in the Project or Exploration/Gene tab displays the number of distinct (unique) mutations within the affected cases and not necessarily the total number of all mutations within the project or query filter.

Why are the number of analyzed cases in the MAF header not equal to the number of cases displayed in the GDC Data Portal?

Within the GDC data analysis workflow, both public (somatic) MAFs and protected MAFs generated are from the same pipeline and link back to the same cases. For example, For the TCGA-GBM project, the somatic MAF has the following header:

# in TCGA.GBM.muse.7e85de23-3855-4279-a3ac-a81827e4ccb6.DR6.0.somatic.maf.gz #version gdc-1.0.0 #filedate 20170307 #n.analyzed.samples 393

In general, n.analyzed.samples is used as a denominator to calculate mutation frequencies. If no variants for a case passed our filters, the case should still be counted; however, if the case was determined to have poor quality (such as for high contamination, duplicates etc.), it is not counted in the public MAF. In this particular project (TCGA-GBM), there were 396 cases with SNV data. Our analysis pipeline revealed that among them, a total of 5 GBM tumor aliquots had high contamination. Among these 5 patient, 2 had another good tumor aliquot, but 3 had only one aliquot. As the result, those 3 cases were removed from the public MAF.

Why does the GDC display common genes such as TTN that are associated with every cancer in the most frequently mutated genes table?

The GDC is not normalizing frequency by gene length. This is currently under discussion. As such, these genes are appearing in the mutated genes table. Users can filter by the COSMIC Cancer Gene Census to display only genes for which mutations have been causally implicated in cancer.

In the OncoGrid, why are there less cases than there are cases listed as having mutations?

The cases in the OncoGrid are filtered by consequence type. Only cases that have mutations that have consequence types of: {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained} are displayed in the OncoGrid.

How do I search for a particular mutation?

To search for a mutation, you can utilize the Quick Search bar at the top right portion of the GDC Portal by entering in either a dbSNP reference cluster ID (rs#) or the coordinates of the chromosomal change. For example entering in 'rs121912651' or 'chr17:g.7674221G>A' will bring the user to the mutation entity page for that mutation.

Why are there fewer cases in the 'Top Mutated Cancer Genes in Selected Projects' bar graph on the Project List Page, than there are affected cases listed on each project page?

There are less cases displayed with mutations in the 'Top Mutated Cancer Genes in Selected Projects' on the Project List Page because there is a filter on cases that have mutations on 1) Genes in the Cancer Gene Census and 2) Mutations with consequence types of {missense_variant, frameshift_variant, start_lost, stop_lost, initiator_codon_variant, stop_gained}.

Why is the mutation frequency value higher (more mutations) in the OncoGrid than the # of Mutations listed on other pages for the same gene?

Mutation frequency in the context of the OncoGrid represents total mutation occurrences in the gene (total count), while the # of Mutations listed on other portions of the GDC Portal represents the number of unique mutations on a gene or within a particular cohort.

How can I tell if a file is from a tumor or normal sample?

Selecting the files of interest in the GDC Portal and adding them to the cart will give access to the "Sample Sheet" file. This file contains many fields like "File Name", "File ID" (UUID), and "Sample Type" for each file. The "Sample Type" field will denote whether the sample is from tumor or normal tissue and the other fields can be used to locate the appropriate files.

I logged into the GDC Portal yesterday, why can I not login to the GDC Portal today?

The GDC uses the eRA Commons login credentials to determine access to controlled data. The eRA Commons password does expire preventing users from logging in. The user will need to visit the eRA Commons login site. After logging in, the user will then be prompted to create a new password.

Does the GDC provide access to follow-up (i.e. longitudinal) data?

The availability of follow-up data is specific to the project and associated study.

For the Multiple Myeloma Research Foundation (MMRF) Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile (CoMMpass) study, longitudinal information was generated to track patients over the course of their disease. This data is available in the GDC by viewing the clinical follow-up data available for download on each case page in the GDC Data Portal or by querying the GDC API.

For TCGA, follow-up data is available for specific TCGA studies and made available for download in associated clinical supplement files (i.e. clinical XML, biotabs). Follow-up data can be different for the different TCGA studies.

Why does the treatment data appear to be incomplete and what treatment data is available in the GDC?

Submitting treatment data is optional as not all projects are associated with treatment studies. For TCGA projects, for example, not all projects and cases have treatment data. For TCGA projects with treatment data, information is available in applicable clinical supplement files (i.e. clinical XML, biotabs). For other project associated with treatment studies in which the treatment data has been submitted to the GDC, treatment data is available for download in JSON and TSV format. These studies may also contain clinical supplement files.

How do I access data from TCGA marker or other landmark cancer genomics papers?

The TCGA marker and other landmark cancer genomics papers, as well as associated supplemental files, are available on the GDC Publication Pages. The Publication Pages provide access to publication information and supplementary files.

Why is the data maintained in cBioPortal, Broad Firehose, or the Seven Bridges Cancer Genomics Cloud different from the GDC data?

The GDC harmonizes data across projects. This includes aligning the genomic data to a common reference genome (HG38) and generating higher level data using GDC bioinformatics pipelines. Other repositories may process the data differently.

For example, TCGA data in cBioPortal uses the original mutation data generated by the individual TCGA sequencing centers. The source of the data is the Broad Firehose (or the publication pages for data that matches a specific manuscript). These data are usually a combination of two mutation callers, but they differ by center (typically a variant caller like MuTect plus an indel caller), and sequencing centers have modified their mutation calling pipelines over time. TCGA data in the GDC is harmonized with the latest reference genome (GRCh38). Mutations are called using four variant callers: MuTect, VarScan2, MuSE, and Pindel.

What is the difference between tissue "collection" and tissue "procurement" in TCGA data?

TCGA “collection” represents the collection of the sample for TCGA, whereas “procurement” represents the removal of tissue from the patient.

Where can I find clinical data elements specific to my cancer research of interest?

The GDC supports the submission of clinical and biospecimen supplements. Supplemental files can be downloaded from the GDC by searching for the Data Type "Clinical Supplement" or "Biospecimen Supplement" from the facet search in the GDC Data Portal Repository. For TCGA data, the supplement data is provided in XML documents and tab delimited files (biotabs). These files, in varying degrees, provide information on marker status (e.g. EBV status), treatment regimen, slide magnification, histology distinctions, and staging questions.

What considerations should be taken when comparing samples with different target capture kits?

Target capture kits are used to "target" specific regions of a given genome for the Whole Exome Sequencing (WXS) and Targeted Sequencing experimental strategies. Users should therefore take care when comparing data from different target capture kits for the WXS and Targeted Sequencing experimental strategies because of potential differences in genomic regions targeted, variant filtering, and subsequent variants recovered. Additionally, users should also consider that Whole Genome Sequencing (WGS) does not use target capture kits and may thus recover variants that are excluded in the WXS and Targeted Sequencing experimental strategies.

From the GDC Data Portal "Repository" page, the names of the target capture kits can be accessed by clicking the "Adding a File Filter" link and choosing "analysis.metadata.read_groups.target_capture_kit" from the menu. The GDC does not distribute target or bait files because they are intellectual property. Users should contact each individual project directly if detailed target capture region information is needed.

What are the benefits of updating from GENCODE v22 to v36?

GENCODE gene sets are continuously updated to improve the coverage and accuracy. GENCODE 36, which was released in October of 2020, includes many updates to definitions of genes, transcripts, long non-coding RNAs, and other types of annotations. The previous version used by the GDC (GENCODE 22) was released in March 2015. Both versions were built on Ensembl genome assembly GRCh38.

What data types were updated in DR 32 (GENCODE Update Release)?

Replaced all RNA-Seq data including: Alignments, Gene Expression (STAR) + New Normalization, Transcript Fusion
Removed HTSeq Files
Re-harmonized TCGA data to use the newer pipeline

Generated and versioned new annotated somatic mutations and Ensemble MAFs
Re-harmonized TCGA data to use the newer pipeline (alignments + mutation calls)

Generated and versioned structural variant and gene level copy number data

Re-harmonized TCGA methylation data to use the new SeSAMe pipeline

Generated and versioned CPTAC-3 scRNA-Seq data

Replaced gene level copy number files for TCGA with those harmonized using ASCAT
Replaced somatic mutation files for FM-AD and transitioned to aliquot-level MAFs
Replaced all GENIE files

Why are there fewer open access TCGA mutations in DR 32 (GENCODE Update Release)?

The primary reasons for the fewer open-access mutations are from two strategies that improve quality: 1) TCGA is now using a 2-caller ensemble, instead of a single caller; 2) Removal of variants outside of the target capture region, instead of a combined “target capture + GAF exonic region”. Additionally, TCGA was the original project in which GDC open-access variants were produced and used variant rescue steps that only applied to TCGA. To keep the TCGA variant-calling pipeline consistent across projects, GDC is no longer rescuing MC3 and TCGA validation variants.

Were any COSMIC genes removed in DR 32 (GENCODE Update Release)?

No. COSMIC genes present in the prior release were removed in DR32. For this reason, the GDC considers that the current release should be considered a higher-quality set of variants.

Can GENCODE v22 data still be downloaded from the GDC Data Portal?

Although GENCODE v22 data cannot be browsed in the GDC Data Portal, it can still be downloaded using the GDC Data Transfer Tool or API. You will need to either have a previous manifest or use known UUIDs to download v22 files.

Why are certain aliquots that were previously available in the Data Portal unavailable as GENCODE v36 data?

Whenever new parameters are introduced to a bioinformatics pipeline, such as a new gene model, there is a chance that the analysis could fail. A list of aliquots that do not appear in the v36 data currently can be found in the Data Release Notes

Why does the GDC use unstranded TPM when the library is stranded?

The GDC RNA-Seq workflow generates STAR counts in three different modes: unstranded, stranded_first, and stranded_second. The GDC then uses the unstranded counts as the major output for following-up FPKM and TPM normalizations to facilitate cross-project comparisons of different strandness.

How can I know if the RNA-Seq data is stranded or unstranded?

If you are interested in using the stranded data in STAR Count gene expression output, you can make a guess by comparing N_ambiguous: if a stranded type has a much lower number of N_ambiguous compared to the other stranded type and the unstranded count, it is a good indicator of a stranded library being used. Please note that knowing a library is prepared by a stranded-enabled RNA-Seq kit does not necessarily guarantee the resulting library is stranded. In addition, data of different strandness can not be compared to each other directly.

Does the GDC correct for batch effects across samples?

The GDC does not perform batch effect corrections across samples for the following reasons:

The GDC accepts new data on an ongoing basis and has continuous data processing and frequent release plans. It is operationally difficult to perform cross-GDC batch effect corrections in every release.
Many of the batch effect correction processes require manual considerations and project-specific knowledge.
Automatic batch effect correction might remove real biological meaningful signals, especially when batch effect is confounded with real biological effects.

As such, the GDC prefers that users to perform their own batch effect removal.

Does the GDC use common exomes across all whole exome platforms?

There is a variety of target capture kits used by different sequencing centers. Most of the whole exome capture kits share many common genomic regions, especially for cancer related genes; However, which exons are included is totally dependent on the vendor's library preparation kit. There are often more differences among capture regions from different Targeted-Sequencing/Panel data.

The name of the capture kit used is available from the GDC read group properties. However, the GDC does not distribute the BED files for the read groups associated with these capture kits because some of them are proprietary.

How often does the GDC update the workflow/reference genome? If the GDC updates the workflow/reference genome, does the GDC re-process all data sets?

For the reference genome, the GDC has been using an augmented version of GRCh38.p2 (with additional decoy sequences and virus sequences) since inception. The GDC does not use alternative contigs, and only derives high-level data from the major chromosomes, so the same reference genome is used for both gene model GENCODE v22 (from Data Release 1 to 31) and GENCODE v36 (from Data Release 32). As future versions of the reference genome are released, e.g., GRCh39, the GDC will evaluate the benefits of updating data to utilize the new version. By updating the reference genome, the GDC would expect to re-process all data sets. For information on the reference genome used by the GDC, please refer to the GDC Reference Files.

For workflow updates, the GDC prefers to keep the workflow stable, and will not update unless there are necessary updates such as updates of the reference genome or gene model, or major algorithm updates in the tools that could result significant changes in the generated data. When workflow updates are actually needed, the GDC categorizes them as either major updates or minor updates depending on whether the update significantly affects the output data. The GDC will re-process all existing data sets in major workflow updates, and such examples include transitioning the RNA-Seq genomic BAM alignment workflow into a new version that generates three BAMs and STAR counts; and updating the MAF workflow to add additional functions to the MAF files. Minor updates mostly happen to resolve bugs, security issues, and/or compatibility issues. For example, the GDC DNA-Seq alignment workflow has been updated several times to address quality issues from various submitted data; however, because the main alignment algorithm remains almost the same, the GDC does not need to re-process all the data sets for these minor updates.

Why did the GDC remove HTSeq for gene expression quantification?

HTSeq had been the default RNA-Seq expression quantification tool since the first GDC data release. The GDC later updated the RNA-Seq alignment and quantification workflow to include STAR Count, which generates stranded counts by default in addition to the existing unstranded counts. During Data Release 32 for gene model updates, the GDC had 1) augmented the existing STAR Count output to include FPKM and FPKM-UQ normalizations; 2) reprocessed all the TCGA data using the latest RNA-Seq workflow with STAR Count. Because both tools use very similar counting strategies, and STAR Count has the advantages in both running time and the additional stranded counts, the GDC removed HTSeq workflow in Data Release 32.

Does the GDC provide access to germline variants?

Any germline SNP calls are not available for exploration in the GDC Data Portal. Instead, alignments for germline data are available under controlled access. Users with appropriate access may use the alignments to generate germline variants.

Some somatic variants callers, such as MuTect2, also output somatic calls with some level of germline possibilities, such as those labelled as "germline_risk". Please note that these calls are, by no means, germline variants. They are somatic calls with boundary probability of germline risks.

Why did the GDC remove SomaticSniper?

The SomaticSniper whole exome variant caller was one of the first generation somatic mutation callers developed by the scientific community. It works the best with blood cancer that has high level of tumor-in-normal contaminations, but is often overly permissive for solid tumors. Since our first data release in 2016, the GDC has gradually adopted newer tools or new tool versions, and has transited the focus of somatic variant calling from any single caller to multi-caller ensemble.

After comparing ensemble calls with and without SomaticSniper and also receiving feedback from the authors of SomaticSniper, the GDC decided to remove this tool from our production in Data Release 35. The GDC still maintains other four whole exome variant callers, including MuSE, MuTect2, Pindel, and VarScan2.

Why do some projects with WGS structural variant data have BEDPE files and some projects do not?

Generally any WGS data should have associated structural variant files (BEDPE) except in the cases in which either there are no tumor/normal matches or when variant calling has not been implemented yet.

In the GDC Data Portal, when I filter by Experimental Strategy (e.g., RNA-Seq) in the Cohort Builder Available Data filter, why is this filter not working when I navigate to Repository?

Filtering in Cohort Builder results in a set of cases. Cases typically have available data for multiple Experimental Strategies (e.g., cases with RNA-Seq data can also have DNA-Seq data). As such, when navigating to Repository additional data besides RNA-Seq data is displayed. Repository filters are file based. When filtering Repository by Experimental Strategy (e.g., RNA-Seq) only files associated with the Experimental Strategy are displayed.

In the GDC Data Portal, where is the histogram of top frequently mutated genes for a cohort?

To view the top frequently mutated genes for a cohort, first build a cohort using the Cohort Builder and then select the Mutation Frequency tool in the Analysis Center

In Cohort Builder, why are there multiple tissue or organ of origins that appear to not be associated with the primary diagnosis?

Filtering in Cohort Builder results in a set of cases. A case can have multiple diagnosis and tissues or organ of origins may be related to a secondary diagnosis.

Why do some genes show no expression in STAR results across all samples, even though I can see mapped reads in the raw RNA-Seq data?

STAR gene expression quantification excludes reads that are mapped to multiple different genes. This can cause some genes to appear with zero expression in the final counts, even if mapped reads are present in the raw data.

One common reason for this is gene overlap. These genes often have their exons entirely encompassed within other genes, and in such cases, STAR cannot assign reads to them because they are ambiguous. To check if a gene falls into this category, you can refer to the following lists: Stranded Counting Overlap Gene List: overlap.gene.stranded.tsv, and Strandless Counting Overlap Gene List: overlap.gene.strandless.tsv.

Why can’t I download data immediately after receiving dbGaP access to a study?

It may take up to 24 hours for the GDC to sync with dbGaP. Please allow this time after receiving dbGaP authorization before attempting to access the study’s data on the GDC.

How does the GDC choose the default transcript for each variant?

When a mutation overlaps multiple transcripts or genes, the GDC annotates all consequences in the all_effects column of the MAF file and in the CONSEQUENCE table on the Mutation Summary Page. One transcript is then selected as the default for detailed annotation and visualization where a single consequence is shown.

The default is chosen based on annotations from the Variant Effect Predictor (VEP), prioritizing the most severe consequence on the most impactful transcript biotype (See: selected annotation for the 'OneEffect'). The GDC also applies a curated transcript override file from MSKCC, which defines preferred transcripts for key genes, along with consideration for canonical and longest transcripts.

During GENCODE v36 updates in GDC Data Release 32 (DR 32), Some hotspot mutations may display a different default consequence annotation. For example, BRAF V600E is shown as BRAF V640E after DR32 because the curated BRAF transcript ENST00000288602 was updated by GENCODE with an addition of 40 amino acids at the N-terminus. Changes like this affect the default consequence annotations for all BRAF mutations and may similarly impact other genes with updated transcript models. Although the default annotation changed, users can still find V600E listed in the all_effects column in MAF files or the CONSEQUENCE table on the Mutation Summary Page alongside other transcript annotations.

How are the five categories of copy number changes determined?

The GDC begins with integer-level estimates of absolute copy number generated by either the ASCAT or ABSOLUTE pipeline. To establish a baseline, an integer-valued sample ploidy is computed as follows:

For gene-level CNV, the mode of copy number values is used across all autosomal protein-coding genes.
For segment-level CNV, a length-weighted mode of copy number values is computed across all autosomal segments.
In cases of a tie, the mode is rounded up.
Please note that the integer-valued sample ploidy used here differs from the floating-point ploidy estimates produced directly by the ASCAT or ABSOLUTE pipelines. The latter should be considered the more precise representation and is recommended for use in most other bioinformatics analyses.

Based on this sample ploidy value, the GDC assigns copy number categories as:

Homozygous deletion: copy number = 0 Loss: 0 < copy number < sample ploidy
Neutral: copy number = sample ploidy
Gain: sample ploidy < copy number < 2 × sample ploidy
Amplification: copy number ≥ 2 × sample ploidy

Survival Analysis Formula

NCI Press Offices

National Cancer Institute

at the National Institutes of Health