Main Content

Access Data

How can I access GDC sequencing data in FASTQ format?

Submitted by Anonymous on

Raw sequencing files submitted to the GDC are processed using GDC Genomic Data Alignment pipelines. The processed data are made available in the GDC Data Portal as BAM files containing aligned reads and unmapped reads (if available). No reads are hard-clipped, but reads that were flagged as "failed" during an Illumina sequencing run are discarded.

Where can I find the target and bait/probe files (BED files) that describe the capture kit used in an exome sequencing experiment?

Submitted by Anonymous on

Capture kit information is provided by the GDC API at the read group level, where available. In some cases, additional information may be available in SRA XML files.

The relevant read_group properties returned by the GDC API are:

How do I avoid timeouts and transfer interruptions when downloading large datasets from the GDC Data Portal?

Submitted by Anonymous on

The GDC Data Portal is a web-based application that is limited by browser and network constraints. If a system timeout occurs when downloading files, please use the GDC Data Transfer Tool or contact the GDC Help Desk.

I only see patients with ages of 90 years or less in the GDC. Why is this?

Submitted by Anonymous on

HIPAA guidelines require that patients with ages greater than 89 years be aggregated into a single age category. This is to limit the ability to positively identify these individuals. In practice this will impact the values reported in several fields. We have chosen to accurately display the age at diagnosis, but fields that give dates or time periods after this benchmark may be compressed. This may include such fields as "Days to last follow up", "Days to last known disease status", "Days to recurrence", "Days to death", and "Year of death".

Subscribe to Access Data