Main Content

TCGA Cancer Subtype Assignment Of Patient Samples Using Compact Feature Sets

Citation TBD

Molecular subtypes have previously been defined for cancer cohorts within The Cancer Genome Atlas (TCGA), using different classification approaches and genomic platform technologies. Here we address how a newly diagnosed tumor might be efficiently profiled using a limited set of molecular markers, and its subtype membership identified relative to previously characterized TCGA cancers. We describe results using five different machine-learning approaches applied to multi-omic data for 8,791 TCGA cancers comprising 26 different types and 106 subtypes, to derive classification models using parsimonious gene feature sets that support the prediction of a given sample’s molecular subtype. We compare the predictive accuracy of the diverse approaches and compare their associated features, revealing insights into how different single or multi-omic data platforms can be used to predict cancer molecular subtypes. We found that 70 samples are sufficient to derive an estimate of classification accuracy for a prospectively accruing cancer cohort.

Data in the GDC

Supplemental Data

Instructions for Data Download

Open Access Data

  1. Download the appropriate manifest file from the publication page
  2. Use the manifest file to download data using the GDC Data Transfer Tool (DTT) or the GDC API

Controlled Access Data

  1. Download the appropriate manifest file from the publication page
  2. Download a token from the GDC Data Portal
  3. Use the manifest file and token to download data using the GDC DTT or the GDC API

For assistance, please contact the GDC Help Desk: support@nci-gdc.datacommons.io.