Main Content

GDC and Artificial Intelligence (AI)

AI in Cancer Genomics

Artificial Intelligence (AI), along with its subfields of Machine Learning and Deep Learning, is transforming the field of cancer genomics. These advanced technologies are pivotal in identifying mutation patterns, classifying disease and predicting progression, developing new treatments, and enhancing our understanding of cancer biology. AI is driving breakthroughs and accelerating discoveries through the efficient processing of vast genomic datasets, revolutionizing how we approach cancer research and treatment. 

For information on the use of AI across the NCI, please refer to the AI and Cancer and AI in Cancer Research sites.

AI in the GDC

Large foundational datasets are critical for training and development of the next generation of AI applications. Notably, the GDC offers the cancer research community access to high-quality harmonized genomic, clinical, biospecimen data, and whole slide images for use in the development of AI models and algorithms. Below are example AI applications showcasing how GDC data is utilized by the research community.

Application Objectives
Biomarker Discovery
  • Identify biomarkers targeted by immunotherapy using genomic data
  • Train machine learning classifiers to detect driver genes
Cancer Diagnosis & Risk Prediction
  • Integrate whole slide imaging and omics data to enhance diagnostic predictions
  • Train large language models (LLMs) with somatic mutation data from whole genomes to predict cancer risk
Content Generation
  • Accelerate bioinformatics research and genomic discoveries with a natural language processing (NLP)-enabled biomedical research platform
Cancer Type Classification
  • Diagnose and classify cancer types using whole slide images (WSI) 
  • Classify cancer types based on SNPs or gene expression data
  • Determine the risk of metastasis by classifying cancer cells
Drug Development & Target Discovery
  • Identify drug discovery targets by integrating omics and clinical data
  • Explore drug repurposing opportunities using driver gene analysis
  • Classify patient responses to treatments to guide drug design
Feature Detection
  • Employ GDC data as training and test datasets for AI feature detection algorithms
  • Computationally annotate features on tumor histology slides
  • Detect biopsy features to assess sample quality
Image Segmentation & Quality Control
  • Apply AI for WSI segmentation to support image analysis
  • Enhance WSI quality by correcting potential artifacts
Model Development
  • Use cancer genomics and deep learning to advance understanding of cancer biology
  • Train and test machine learning models with whole genome sequences and mutations
  • Develop models to annotate small RNA sequences
  • Build AI models integrating WSI and clinical data
Personalized Medicine
  • Recommend personalized therapies to improve patient outcomes
Predictive Analysis
  • Predict molecular subtypes using gene features
  • Use WSI to predict mutations
  • Apply deep learning to predict cancer stages and survival outcomes
Survival Analysis
  • Perform survival analysis with AI using whole slide imaging and genomic data
  • Utilize deep learning models to forecast one-year survival rates
Genomic Data Analysis
  • Classify somatic variants for Gene Set Enrichment Analysis

GDC Resources Supporting AI

The GDC offers a variety of resources to support the use of GDC data in AI applications:

  • GDC Data Portal: Explore, analyze, and download data from the GDC for specific cancer cohorts
  • GDC Application Programming Interface (API): Programmatically query, download, and analyze GDC data
  • GDC Data Dictionary: Access detailed information on genomic, clinical, and biospecimen properties within GDC data
  • GDC Data Transfer Tool (DTT): Efficiently download large data sets with this high-performance command-line tool
  • Harmonized Data: Download genomic data that has been standardized using GDC workflows, including DNA and RNA sequence data that has been aligned against a common reference genome build and derived data such as mutation calls and structural variants. Obtain access to clinical and biospecimen data, and genomic metadata harmonized using common data elements from the GDC Data Dictionary.
  • High Quality Data: Access data that has undergone rigorous quality control and validation checks, including genomic quality metrics and clinically validated datasets

The GDC is committed to supporting AI an cancer applications by providing well documented, accessible, usable, and high quality data. We encourage community feedback through GDC Support to better understand the needs for cancer genomic resources that facilitate AI development.