Citation TBD
Molecular subtypes have previously been defined for cancer cohorts within The Cancer Genome Atlas (TCGA), using different classification approaches and genomic platform technologies. Here we address how a newly diagnosed tumor might be efficiently profiled using a limited set of molecular markers, and its subtype membership identified relative to previously characterized TCGA cancers. We describe results using five different machine-learning approaches applied to multi-omic data for 8,791 TCGA cancers comprising 26 different types and 106 subtypes, to derive classification models using parsimonious gene feature sets that support the prediction of a given sample’s molecular subtype. We compare the predictive accuracy of the diverse approaches and compare their associated features, revealing insights into how different single or multi-omic data platforms can be used to predict cancer molecular subtypes. We found that 70 samples are sufficient to derive an estimate of classification accuracy for a prospectively accruing cancer cohort.
Data in the GDC
- GDC Manifests
- Open-Access Data - Download Manifest (42 Files)
Supplemental Data
- Associated Data Files
- tmp_model_output_files_format.md - sample prediction and feature list reporting file format
- tmp_prediction_file_to_performance_metrics.py - model scoring script
- big_results_matrix.tsv.gz - model prediction performance metrics
- collected_features_matrix.tsv.gz - feature lists as model-by-feature matrix
- feature_importance.tsv.gz - feature importance scores
- feature_lists.tsv.gz - feature lists
- collected_features_matrix_top_models_lte_100.tsv.gz - model-by-feature matrix for top models per method/cohort combination
- collected_genes_matrix_top_models_lte_100_exclude_CNVR_cohort_level.tsv.gz - model-by-gene matrix for top model per cohort (excluding genes from CNVR features)
- collected_genes_matrix_top_models_lte_100_exclude_CNVR.tsv.gz - model-by-gene matrix for top models per method/cohort combination (excluding genes from CNVR features)
- top_performing_models_lte_100_features.tsv.gz - model prediction performance metrics for top models per method/cohort combination
- very_top_performing_models.tsv.gz - model prediction performance metrics for top model per cohort
- pathway_commons_v12_main_component_20211122.RData - pathway for generating pruned version (line below)
- pathway_commons_v12_pruned_for_landscape_analysis_20211122.RData - pathway for figure 6
- METABRIC_PAM50_silhouettes_no_Normal-Like_2022-03-15.tsv - Metabric subtype silhouette scores, for Figure 3D
- aklimate_predict_metabric_brca.tsv - AKLIMATE subtype predictions on rescaled Metabric samples, for Figures 3C and D
- metabric_adaboost.tsv - SKGrid subtype predictions on rescaled Metabric samples, for Figure 3C
- subscope.tar.gz - SubSCOPE Docker Image
- aklimate.tar.gz - AKLIMATE Docker Image (file)
- jadbio.tar.gz - JADBio Docker Image (file)
- models_jadbio.tar.gz - JADBio model data associated with Docker Image
- sk_grid.tar.gz - SK Grid Docker Image (file)
- cloudforest.tar.gz - CloudForest Docker Image (file)
- models_cf.tar.gz - CloudForest model data associated with Docker image
- aklimate_feature_importance_scores_20200807.tar.gz - AKLIMATE importance scores for Figure 5
- 20220425_TMP_DNA_methylation_features_analysis_COAD.tsv - Figure 5 input data; COADREAD methylation analysis
- 20220425_TMP_DNA_methylation_features_analysis_LGGGBM.tsv - Figure 5 input data: LGGGBM methylation analysis
- brca_pam50_hits.tsv - Figure 5 input data; BRCA PAM50 membership
- jadbio_ft_importances_f1.tar.gz - Figure 5 input data; JADBIO feature importances
- modelID_performance2importance.json - Figure 5 input data; top models mapping of classifier name to feature set name
- modelID_performance2importance_ALLCOHORTS.json - Figure 5 input data; all models mapping of classifier name to feature set name
- TMP_sub-sampling_experiment.tgz - Sub-sampling experiment raw results, 100 predcitions at each sampling size for each cohort
- skgrid_results_20210708.tar.gz - Result files from SciKit Grid
- aklimate_predictions_and_features_20200630.tar.gz - Result files from Aklimate
- subSCOPE_results.tar.gz - Result files from subSCOPE
- gnosis-results.tar.gz - Result files from the Gnosis platform
- Cloud_Forest_v12_sample_predictions.zip - Result files from CloudForest
- TMP_20230209.tar.gz
- ft_name_convert.tar.gz - User renaming of features to TMP nomenclature
- TMP_v12_20210228.tar.gz - Source data matrix
- very_top_performing_models.tsv.gz - Model scoring script
- tmp_prediction_file_to_performance_metrics.py - Model prediction performance metrics for top model per cohort
- model_info.json - Top model info (name, parameters, ft list) - Docker
Instructions for Data Download
Open Access Data
- Download the appropriate manifest file from the publication page
- Use the manifest file to download data using the GDC Data Transfer Tool (DTT) or the GDC API
- GDC DTT ( Download, User's Guide)
- GDC API ( User’s Guide)
Controlled Access Data
- Download the appropriate manifest file from the publication page
- Download a token from the GDC Data Portal
- GDC Data Portal ( Launch, User’s Guide)
- Use the manifest file and token to download data using the GDC DTT or the GDC API
- GDC DTT ( Download, User’s Guide)
- GDC API ( User’s Guide)
For assistance, please contact the GDC Help Desk: support@nci-gdc.datacommons.io.