Are unmapped reads available in the GDC Data Portal?
Harmonized BAM files from RNA-seq and DNA-seq experiments will contain both mapped and unmapped reads, if available. Unmapped reads are not distributed separately.
Harmonized BAM files from RNA-seq and DNA-seq experiments will contain both mapped and unmapped reads, if available. Unmapped reads are not distributed separately.
Capture kit information is provided by the GDC API at the read group level, where available. In some cases, additional information may be available in SRA XML files.
The relevant read_group
properties returned by the GDC API are:
The GDC Data Portal is a web-based application that is limited by browser and network constraints. If a system timeout occurs when downloading files, please use the GDC Data Transfer Tool or contact the GDC Help Desk.
The GDC provides access to both open and controlled datasets. To access controlled datasets, users must obtain appropriate authorization through dbGaP. See Obtaining Access to Controlled Data for instructions on applying for access through dbGaP.
The following web browsers are supported for use with the GDC Data Portal, Submission Portal, Website, and Documentation site.
HIPAA guidelines require that patients with ages greater than 89 years be aggregated into a single age category. This is to limit the ability to positively identify these individuals. In practice this will impact the values reported in several fields. We have chosen to accurately display the age at diagnosis, but fields that give dates or time periods after this benchmark may be compressed. This may include such fields as "Days to last follow up", "Days to last known disease status", "Days to recurrence", "Days to death", and "Year of death".
The GDC processes data through several harmonization pipelines. If the process of harmonization reveals issues in the underlying data or if an error occurred during harmonization, the harmonized data files (e.g. BAMs or VCFs) will not appear in GDC data access tools.
There has been long standing debate about prefixes for multiples of bytes. We have chosen to utilize the standard supported by the International System of Units (SI) where 1 gigabyte (GB) = 109 bytes or 1 megabyte (MB) = 106 bytes. This convention is also supported by the IEEE, EU, NIST, and the International System of Quantities. Where appropriate, we utilize the IEEE 1541 recommendations for binary representation where 10243 bytes = 1 gibibyte (GiB) or 10242 bytes = 1 mebibyte (MiB).
Recurrent data releases allow the GDC to version the data and allow users to reference the GDC version number in publications. GDC currently generates releases as needed, with a release every 2-3 months as a goal.
The GDC validates genomic data (BAM files) using FASTQC and Picard. For additional details, please visit: GDC Data Harmonization.