Main Content

GENCODE Update

How often does the GDC update the workflow/reference genome? If the GDC updates the workflow/reference genome, does the GDC re-process all data sets?

Submitted by Anonymous on

For the reference genome, the GDC has been using an augmented version of GRCh38.p2 (with additional decoy sequences and virus sequences) since inception. The GDC does not use alternative contigs, and only derives high-level data from the major chromosomes, so the same reference genome is used for both gene model GENCODE v22 (from Data Release 1 to 31) and GENCODE v36 (from Data Release 32). As future versions of the reference genome are released, e.g., GRCh39, the GDC will evaluate the benefits of updating data to utilize the new version.

How can I know if the RNA-Seq data is stranded or unstranded?

Submitted by Anonymous on

If you are interested in using the stranded data in STAR Count gene expression output, you can make a guess by comparing N_ambiguous: if a stranded type has a much lower number of N_ambiguous compared to the other stranded type and the unstranded count, it is a good indicator of a stranded library being used. Please note that knowing a library is prepared by a stranded-enabled RNA-Seq kit does not necessarily guarantee the resulting library is stranded. In addition, data of different strandness can not be compared to each other directly.

Why are certain aliquots that were previously available in the Data Portal unavailable as GENCODE v36 data?

Submitted by Anonymous on

Whenever new parameters are introduced to a bioinformatics pipeline, such as a new gene model, there is a chance that the analysis could fail. A list of aliquots that do not appear in the v36 data currently can be found in the Data Release Notes

Why are there fewer open access TCGA mutations in DR 32 (GENCODE Update Release)?

Submitted by Anonymous on

The primary reasons for the fewer open-access mutations are from two strategies that improve quality: 1) TCGA is now using a 2-caller ensemble, instead of a single caller; 2) Removal of variants outside of the target capture region, instead of a combined “target capture + GAF exonic region”. Additionally, TCGA was the original project in which GDC open-access variants were produced and used variant rescue steps that only applied to TCGA. To keep the TCGA variant-calling pipeline consistent across projects, GDC is no longer rescuing MC3 and TCGA validation variants.

What data types were updated in DR 32 (GENCODE Update Release)?

Submitted by Anonymous on
    RNA-Seq

  • Replaced all RNA-Seq data including: Alignments, Gene Expression (STAR) + New Normalization, Transcript Fusion
  • Removed HTSeq Files
  • Re-harmonized TCGA data to use the newer pipeline
    WXS/Targeted Sequencing

  • Generated and versioned new annotated somatic mutations and Ensemble MAFs
  • Re-harmonized TCGA data to use the newer pipeline (alignments + mutation calls)
    WGS

  • Generated and versioned structural variant and gene level copy number data
    Methylation

What are the benefits of updating from GENCODE v22 to v36?

Submitted by Anonymous on

GENCODE gene sets are continuously updated to improve the coverage and accuracy. GENCODE 36, which was released in October of 2020, includes many updates to definitions of genes, transcripts, long non-coding RNAs, and other types of annotations. The previous version used by the GDC (GENCODE 22) was released in March 2015. Both versions were built on Ensembl genome assembly GRCh38.

Subscribe to GENCODE Update