Main Content

Comprehensive Molecular Characterization of Human Colon and Rectal Cancer

Nature. 487: p330-337, 18 July 2012 10.1038/nature11252

To characterize somatic alterations in colorectal carcinoma (CRC), we conducted genome-scale analysis of 276 samples, analyzing exome sequence, DNA copy number, promoter methylation, mRNA and microRNA expression. A subset (97) underwent low-depth-of-coverage whole-genome sequencing. 16% of CRC have hypermutation, three quarters of which have the expected high microsatellite instability (MSI), usually with hypermethylation and MLH1 silencing, but one quarter has somatic mismatch repair gene mutations. Excluding hypermutated cancers, colon and rectum cancers have remarkably similar patterns of genomic alteration. Twenty-four genes are significantly mutated. In addition to the expected APC, TP53, SMAD4, PIK3CA and KRAS mutations, we found frequent mutations in ARID1A, SOX9, and FAM123B/WTX. Recurrent copy number alterations include potentially drug-targetable amplifications of ERBB2 and newly discovered amplification of IGF2. Recurrent chromosomal translocations include fusion of NAV2 and WNT pathway member TCF7L1. Integrative analyses suggest new markers for aggressive CRC and important role for MYC-directed transcriptional activation and repression.

Data in the GDC

Supplemental Data

These data represent a data freeze from Feb 02, 2012. 

Some archives listed for download below contain more sample data than was in the publication. The Supplementary Table 1 should be used as the key for sample identification for data in those archives.

  • Participant Lists
    • Cumulative (COAD and READ) Participant List [txt]
    • Colon (COAD) Participant List [txt]
    • Rectal (READ) Participant List [txt]
  • Mutations
    • Somatic mutations [xlsx]
    • MAGE-TAB archives
      • COAD MAGE-TAB archive [tar]
      • READ MAGE-TAB archive [tar]
    • Exome Sequence BAM File References [tsv/txt]
  • Expression
    • Agilent microarray
    • RNASeq
  • Copy Number
    • Copy number and structural aberrations derived from low-pass whole-genome sequencing using the Illumina HiSeq platform
    • GISTIC marker file [txt.zip]
  • miRNA
  • DNA Methylation
  • Clinical Data
    • Clinical data summary [txt]
    • Biotab files [tar]
    • Clinical data Level 1 archive [tar]
  • Annotations
    • Annotations for all biospecimens [txt]

Supplementary Files

  • Supplementary Table 1 [xls|html]: Following is a brief description of the contents in each worksheet:
    • Summary: for each participant, this table provides a summary of data types analyzed and the clinical data values used as input for analysis.
    • Mutations: a list of BAM files and their metadata that were used for identifying mutations.
    • microRNA: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of microRNA
    • SNP6: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of copy number.
    • WGS: A list of low-pass whole genome sequencing (WGS) BAM files and their metadata that were used for identifying structural variation.
    • RNASeq: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of RNA Sequence gene expression.
    • Agilent: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of Agilent microarray gene expression.
    • Methylation: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of methylation.
  • Affymetrix SNP 6 data: The level 3 data were derived using a method developed for this work from Level 1 data that are available at the DCC. The standard level 3 data now at the DCC were not used in this analysis.
    • Data matrix [tsv.zip]
    • Level 3 archives [tar]
    • Level 1 archives [tar]
    • Level 2 archives [tar]
    • MAGE-TAB archives [tar]
  • PARADIGM pathway analysis (Supplementary Data File 1) [zip]: A ZIP archive file containing the relevant data to reproduce the PARADIGM pathway analysis. The archive contains the following five files:
    • SuperPathway.txt: Superimposed Pathway used by the PARADIGM analysis. All of the merged concepts and interactions pooled from NCI-PID, Reactome, and BioCarta databases. At the top of the file, declarations of all of the concepts (genes, complexes, families, processes) can be found. Beneath these declarations are all of the regulatory interactions including transcriptionally activating (-t>), transcriptionally inactivating (-t|), subunit to complex relations (-component>), post-transcriptionally activating (-a>), post-transcriptionally inactivating (-a|), activation of an abstract process (-ap>), inhibition of an abstract process (-ap|), and membership in a family relation (-member>).
    • tcgaCOADREAD_Expression.vNormal.MANUSCRIPT.tab: A PARADIGM-ready version of the expression data formatted as a tab-delimited file with the expression rank-ratios given as input to the PARADIGM algorithm.
    • tcgaCOADREAD_CNV.vNormal.MANUSCRIPT.tab: A PARADIGM-ready version of the copy number data. A tab-delimited file containing the copy number rank-ratios given as input to the PARADIGM algorithm.
    • params.txt: The set of parameters needed to run PARADIGM that determine the initial setting of the constraints between concept- and interaction-related constraints (probabilistic factors). These parameters were learned from previous rounds of learning on other cancer cohorts and reused for this analysis.
    • config.txt: Contain settings for how PARADIGMs inference engine was run for the CRC analysis. The file specifies that the belief propagation method for maximum likelihood inference should be used with a maximum of 10,000 iterations for convergence and that the datasets for gene expression and copy number to be used are the files listed above.
  • Cytoscape data (Supplementary Data File 2) [cys]: A network of the pathway concepts found by PARADIGM to be significantly modulated across the colonic and rectal tumor samples. The file contains the network as a Cytoscape session that has been tested on versions 2.6 or later. Nodes in the network correspond to concepts in the Superimposed Pathway and include genes (circles), complexes (hexagons), families (triangles), and cellular processes (boxes). Concepts are connected by regulatory interactions depicted as either activating (arrows) or inhibiting (.T.-bars) at the transcriptional level (solid lines), or post-transcriptional level (dashed lines). Subunit membership in complexes are depicted using undirected dashed lines. The network includes concepts with higher activation (red nodes) or inactivation (blue nodes) in tumors compared to normal. The size and opacity of the nodes are drawn as a function of the modulation score.

Analytical Tools

  • CRC Aggressiveness - Explore heterogenous molecular signatures (Mutation, Expression, Copy Number, miRNA, and Methylation) for tumor aggression in Colo-Rectal Carcinomas (CRC) in the context of the genome using Regulome Explorer.
  • cBio Cancer Genomics Portal

Additional Resources

Instructions for Data Download

Open Access Data

  1. Download the appropriate manifest file from the publication page
  2. Use the manifest file to download data using the GDC Data Transfer Tool (DTT) or the GDC API

Controlled Access Data

  1. Download the appropriate manifest file from the publication page
  2. Download a token from the GDC Data Portal
  3. Use the manifest file and token to download data using the GDC DTT or the GDC API

For assistance, please contact the GDC Help Desk: support@nci-gdc. datacommons.io.