The following are some helpful resources for general information about cancer:
Helpful Cancer Genomics resource:
Information on the data submission processes and tools are available on the GDC Data Submission Processes and Tools page. Detailed instructions for submitting data into the GDC are provided in the GDC Data Submission Portal User's Guide. Per GDC Policy, organizations interested in submitting data into the GDC must first apply for data submitter access through the NIH database of Genotypes and Phenotypes (dbGaP).
Please visit the GDC Data Types and File Formats for a list of the standard data types supported by the GDC.
The GDC is harmonized against GRCh38. Please see GDC Data Harmonization for additional information on the GDC pipelines for re-aligning genomic data.
The GDC generates high level data for germline and somatic genotyping, RNA-Seq quantification and structural analysis, SNP Array Genotyping and CNV Calls, and variant annotations. Please visit GDC Data Harmonization for additional information on the GDC high level data generation pipelines.
Generally, browsing indexed GDC metadata (such as information about the cases and files contained in the GDC Data Portal) does not require a login.
eRA Commons authentication and dbGaP authorization are required before accessing controlled data, which generally includes individually identifiable information such as low level genomic sequencing data and germline variants.
Controlled-access data users log in to the GDC using their eRA Commons accounts. The GDC then verifies that the user has authorization in dbGaP to access specific controlled datasets.
See Obtaining Access to GDC Data and Resources for more information on data download, and Obtaining Access to Submit Data for information on data submission.
The GDC provides helpdesk support for data submission and other issues. For information on the GDC helpdesk, please visit GDC Support.
Once the project has been registered through dbGaP please contact the GDC Helpdesk for assistance with setting up a new project.
The GDC Data Transfer Tool is recommended for transferring large datasets to or from GDC. For additional details, please visit the GDC Data Transfer Tool User’s Guide.
The GDC Data Transfer Tool does not offer a setting to limit the bandwidth it uses.
The GDC Data Transfer Tool uses sequential read/write for each file segment that is being transferred. By default, the tool executes multipart transfers, which results in multiple parallel, sequential read or write operations. To turn off multipart transfers, users can set the number of processes to 1.
GDC authentication tokens remain valid for 30 days.
The study and Subject IDs must be registered in dbGaP. For additional details, please visit: Obtaining Access to Submit Data.
The GDC validates genomic data (BAM files) using FASTQC and Picard. For additional details, please visit: GDC Data Harmonization.
Uploaded and validated data is put in a workspace until the user formally submits the data to the GDC. This allows users to interact with the data before submitting. Once the data is submitted, the GDC will process applicable datasets (e.g. harmonize molecular data and generate high level data). After processing has been completed, the data is made publicly available according to GDC Data Sharing Policies. The data becomes accessible through GDC tools (GDC Data Portal, GDC APIs) on open or controlled access basis according to the dbGaP authorization policies associated with the data set. For additional information, please visit: GDC Data Submission Processes and Tools.
Recurrent data releases allow the GDC to version the data and allow users to reference the GDC version number in publications. GDC currently generates releases as needed, with a release every 2-3 months as a goal.
The GDC employs a hierarchical data model which requires metadata and files to be attached only at particular nodes or points in the hierarchy. If you have questions, please review the GDC Data Model or contact GDC Support.
There has been long standing debate about prefixes for multiples of bytes. We have chosen to utilize the standard supported by the International System of Units (SI) where 1 gigabyte (GB) = 109 bytes or 1 megabyte (MB) = 106 bytes. This convention is also supported by the IEEE, EU, NIST, and the International System of Quantities. Where appropriate, we utilize the IEEE 1541 recommendations for binary representation where 10243 bytes = 1 gibibyte (GiB) or 10242 bytes = 1 mebibyte (MiB).
The following web browsers are supported for use with the GDC Data Portal, Submission Portal, Website, and Documentation site.
The GDC Data Submission Portal checks XML, JSON, and TSV metadata files for validity at the time they are submitted. If your files fail to validate, please check the error report and review the GDC Data Dictionary for troubleshooting these errors. Additional information on supported files and formats can be found on the GDC Data Model and File Formats pages, and in the GDC Data Submission Portal User's Guide.