GDC Community Tools

.

Disclaimer: The software packages below are not officially supported by the Genomic Data Commons. Please contact the authors of each respective tool for support.

Explore External Tools

Improved DNA Methylation Array Probe Annotation

Authors: Wanding Zhou, Ben Berman, Peter Laird, Hui Shen

This collection of annotations include hg38-based genomic coordinates for Illumina Infinium HumanMethylation27, HumanMethylation450, MethylationEPIC arrays, and masks for probes with low quality (MASK_general column). It also contains detailed information on overlapping gene (including distance of TSS), promoter and CpG island. In addition, functional annotations including overlap with ENCODE and ROADMAP ChromHMM chromatin states and Transcription Factor Binding Sites (TFBS) are also available.


SeSAMe (SEnsible Step-wise Analysis of Methylation data)

Authors: Wanding Zhou, Timothy Triche, Peter W. Laird, Hui Shen

Tool for analyzing Infinium DNA methylation array data.

The SeSAME Tool:

  • Reduction to artifactual detection from Infinium of DNA methylation microarrays
  • Low-level processing of Illumina Infinium DNA methylation array
  • Quality control of DNA methylation arrays
  • Biological inference (sex, age, karyotypes, copy number, etc.)

BISCUIT (BISuilfite-seq CUI Toolkit)

Authors: Wanding Zhou, Timothy Triche Jr., Peter W. Laird, Hui Shen

Tool suite for analyzing high throughput bisulfite sequencing data.

BISCUIT performs:

  • Alignment and quality control of bisulfite sequencing reads
  • Extraction of DNA methylation level
  • Extraction of genetic information from bisullfite-sequencing data 
  • Analysis of allele-specific methylation and methylation haplotype

GDC RNASeq Tool

Author: Colin Reid; GDC User Services

The GDC RNASeq Tool downloads / merges individual RNASeq files from the GDC Data Portal into a matrices identified by TCGA barcode.

The GDC RNASeq Tool:

  • Downloads RNA-Seq / miRNA-Seq data files using a GDC manifest file
  • Unzips the files into separate folders identified by experimental strategy and bioinformatics workflow
  • Merges the files into separate matrix files

GDC TSV Downloader

Author: Bill Wysocki; GDC User Services

The GDC TSV downloader allows the user to use a Manifest from the GDC Data Portal to download clinical and biospecimen metadata for a set of files in a tab-delimited format.


GenomicDataCommons R-Package

The National Cancer Institute (NCI) Genomic Data Commons provides the cancer research community with an open and unified repository for sharing and accessing data across numerous cancer studies and projects via a high-performance data transfer and query infrastructure. The Bioconductor project is an open source and open development software project built on the R statistical programming environment. A major goal of the Bioconductor project is to facilitate the use, analysis, and comprehension of genomic data. The GenomicDataCommons Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the GDC. We expect that Bioconductor developer and bioinformatics community will build on the GenomicDataCommons package to add higher-level functionality and expose cancer genomics data to many state-of-the-art bioinformatics methods available in Bioconductor.


TCGABiolinks

Authors: Tiago Chedraoui Silva, Antonio Colaprico, Catharina Olsen, Michele Ceccarelli, Gianluca Bontempi, and Houtan Noushmehr

TCGAbiolinks was developed as an R/Bioconductor to address challenges with data mining and analysis of cancer genomics data stored at GDC. We offer bioinformatics solutions by using a guided workflow to allow users to query, download, and perform integrative analyses of GDC data. We combined methods from computer science and statistics into the pipeline and incorporated methodologies developed in previous TCGA marker studies. We also provide a graphics user interface (GUI) version of TCGAbiolinks that can run on a user's local machine. TCGAbiolinksGUI contains all the features of the R-version yet allows users an easier way to navigate the analysis steps. We provide online documentations, tutorials, and video guides to assist users with the analysis.


GDCtools

Author: Broad Institute Genome Data Analysis Center

GDCtools is a set of open-source, config-file driven Python and UNIX CLI utilities for interacting with the NCI Genomics Data Commons and automating data cleansing, aggregation and reporting steps that are common to most data-driven science projects. It grew from efforts at the Broad Institute to connect the GDAC Firehose pipeline developed in TCGA to use the GDC as its primary source of data, but aims to go well beyond that. By wrapping the GDC API in a set of rigorously defined and domain-aware tools, GDCtools lets users interact with the GDC in memes familiar to them—as biomedical researchers and informaticians—rather than as web or database programmers. This can make it simpler to search and retrieve either legacy or harmonized data & metadata from the GDC, and shrink the learning and staffing curves, while providing indispensable features such as:

  • Turnkey creation of date-stamped snapshots of data
  • Aggregating multiple samples into a single bolus for ready consumption by scientific algorithms
  • Ensuring that samples are identifiable by project (e.g. restoring TCGA ids to SNP6 segments)
  • Sample report and sample freeze list (load file) creation, for either on-premise or cloud storage (e.g. Google)
  • Aggregate cohort construction (e.g. combining TCGA STAD + ESCA cohorts into STES, with just 1 line in a config file)
  • Retrieving an entire project or just 1 case, with equal ease
  • Easily combining data across multiple projects (e.g. TCGA and CPTAC)

This is all available within a well-tested object-oriented framework that is easy to comprehend and extend by users. GDCtools is online at https://github.com/broadinstitute/gdctools, and includes documentation, examples and a pictorial overview.

-->