Main Content

How can I access GDC sequencing data in FASTQ format?

Submitted by Anonymous on

Raw sequencing files submitted to the GDC are processed using GDC Genomic Data Alignment pipelines. The processed data are made available in the GDC Data Portal as BAM files containing aligned reads and unmapped reads (if available). No reads are hard-clipped, but reads that were flagged as "failed" during an Illumina sequencing run are discarded.

Third-party tools such as biobambam2 or Samtools fastq can convert these files to FASTQ sequencing data. Note that DNA-Seq quality scores are modified during the score recalibration co-cleaning step, so third-party tool parameters must be set to retrieve the original scores (biobambam2: tryoq=1; samtools fastq: -O). Because GDC harmonized BAM files may contain multiple read groups, the conversion parameter should be set to retain read group IDs in the generated FASTQ files (biobambam2: outputperreadgroup=1; samtools: samtools split).

Subject Tag