Classification of non-TCGA Cancer Samples to TCGA Molecular Subtypes Using Compact Feature Sets

Cancer Cell. Volume 43, Issue 2, p.195-212.e11, 10 February 2025 10.1016/j.ccell.2024.12.002

Molecular subtypes have previously been defined for cancer cohorts within The Cancer Genome Atlas (TCGA), using different classification approaches and genomic platform technologies. Here we address how a newly diagnosed tumor might be efficiently profiled using a limited set of molecular markers, and its subtype membership identified relative to previously characterized TCGA cancers. We describe results using five different machine-learning approaches applied to multi-omic data for 8,791 TCGA cancers comprising 26 different types and 106 subtypes, to derive classification models using parsimonious gene feature sets that support the prediction of a given sample’s molecular subtype. We compare the predictive accuracy of the diverse approaches and compare their associated features, revealing insights into how different single or multi-omic data platforms can be used to predict cancer molecular subtypes. We found that 70 samples are sufficient to derive an estimate of classification accuracy for a prospectively accruing cancer cohort.

Data in the GDC

GDC Manifests
- Open-Access Data - Download Manifest (42 Files)

Supplemental Data

Associated Data Files
- tmp_model_output_files_format.md - sample prediction and feature list reporting file format
- tmp_prediction_file_to_performance_metrics.py - model scoring script
- big_results_matrix.tsv.gz - model prediction performance metrics
- collected_features_matrix.tsv.gz - feature lists as model-by-feature matrix
- feature_importance.tsv.gz - feature importance scores
- feature_lists.tsv.gz - feature lists
- collected_features_matrix_top_models_lte_100.tsv.gz - model-by-feature matrix for top models per method/cohort combination
- collected_genes_matrix_top_models_lte_100_exclude_CNVR_cohort_level.tsv.gz - model-by-gene matrix for top model per cohort (excluding genes from CNVR features)
- collected_genes_matrix_top_models_lte_100_exclude_CNVR.tsv.gz - model-by-gene matrix for top models per method/cohort combination (excluding genes from CNVR features)
- top_performing_models_lte_100_features.tsv.gz - model prediction performance metrics for top models per method/cohort combination
- very_top_performing_models.tsv.gz - model prediction performance metrics for top model per cohort
- pathway_commons_v12_main_component_20211122.RData - pathway for generating pruned version (line below)
- pathway_commons_v12_pruned_for_landscape_analysis_20211122.RData - pathway for figure 6
- METABRIC_PAM50_silhouettes_no_Normal-Like_2022-03-15.tsv - Metabric subtype silhouette scores, for Figure 3D
- aklimate_predict_metabric_brca.tsv - AKLIMATE subtype predictions on rescaled Metabric samples, for Figures 3C and D
- metabric_adaboost.tsv - SKGrid subtype predictions on rescaled Metabric samples, for Figure 3C
- subscope.tar.gz - SubSCOPE Docker Image
- aklimate.tar.gz - AKLIMATE Docker Image (file)
- jadbio.tar.gz - JADBio Docker Image (file)
- models_jadbio.tar.gz - JADBio model data associated with Docker Image
- sk_grid.tar.gz - SK Grid Docker Image (file)
- cloudforest.tar.gz - CloudForest Docker Image (file)
- models_cf.tar.gz - CloudForest model data associated with Docker image
- aklimate_feature_importance_scores_20200807.tar.gz - AKLIMATE importance scores for Figure 5
- 20220425_TMP_DNA_methylation_features_analysis_COAD.tsv - Figure 5 input data; COADREAD methylation analysis
- 20220425_TMP_DNA_methylation_features_analysis_LGGGBM.tsv - Figure 5 input data: LGGGBM methylation analysis
- brca_pam50_hits.tsv - Figure 5 input data; BRCA PAM50 membership
- jadbio_ft_importances_f1.tar.gz - Figure 5 input data; JADBIO feature importances
- modelID_performance2importance.json - Figure 5 input data; top models mapping of classifier name to feature set name
- modelID_performance2importance_ALLCOHORTS.json - Figure 5 input data; all models mapping of classifier name to feature set name
- TMP_sub-sampling_experiment.tgz - Sub-sampling experiment raw results, 100 predcitions at each sampling size for each cohort
- skgrid_results_20210708.tar.gz - Result files from SciKit Grid
- aklimate_predictions_and_features_20200630.tar.gz - Result files from Aklimate
- subSCOPE_results.tar.gz - Result files from subSCOPE
- gnosis-results.tar.gz - Result files from the Gnosis platform
- Cloud_Forest_v12_sample_predictions.zip - Result files from CloudForest
- TMP_20230209.tar.gz - Source data matrix
- ft_name_convert.tar.gz - User renaming of features to TMP nomenclature
- very_top_performing_models.tsv.gz - Model scoring script
- tmp_prediction_file_to_performance_metrics.py - Model prediction performance metrics for top model per cohort
- model_info.json - Top model info (name, parameters, ft list) - Docker

Instructions for Data Download

Open Access Data

Download the appropriate manifest file from the publication page
Use the manifest file to download data using the GDC Data Transfer Tool (DTT) or the GDC API
- GDC DTT ( Download, User's Guide)
- GDC API ( User’s Guide)

Controlled Access Data

Download the appropriate manifest file from the publication page
Download a token from the GDC Data Portal
- GDC Data Portal ( Launch, User’s Guide)
Use the manifest file and token to download data using the GDC DTT or the GDC API
- GDC DTT ( Download, User’s Guide)
- GDC API ( User’s Guide)

For assistance, please contact the GDC Help Desk: support@nci-gdc.datacommons.io.

See our Data Model

Want to know more about how our data is organized?

Visit the GDC Data Model Page