GDC Community Tools

The GDC API was designed specifically to accommodate third-party development tools. Here we are highlighting some of the third-party contributions that have been built to make the GDC more accessible to a wide range of users. If you are aware of contributions that are not listed here, please contact GDC User Support at supportatnci-gdc.datacommons.io .

Disclaimer: The software packages below are not officially supported by the Genomic Data Commons. Please contact the authors of each respective tool for support.

Explore the External Analysis Tools

GDC RNASeq Tool

Author: Colin Reid; GDC User Services

The GDC RNASeq Tool downloads / merges individual RNASeq files from the GDC Data Portal into a matrices identified by TCGA barcode.

The GDC RNASeq Tool:

  • Downloads RNA-Seq / miRNA-Seq data files using a GDC manifest file
  • Unzips the files into separate folders identified by experimental strategy and bioinformatics workflow
  • Merges the files into separate matrix files

GDC TSV Downloader

Author: Bill Wysocki; GDC User Services

The GDC TSV downloader allows the user to use a Manifest from the GDC Data Portal to download clinical and biospecimen metadata for a set of files in a tab-delimited format.


GenomicDataCommons R-Package

The National Cancer Institute (NCI) Genomic Data Commons provides the cancer research community with an open and unified repository for sharing and accessing data across numerous cancer studies and projects via a high-performance data transfer and query infrastructure. The Bioconductor project is an open source and open development software project built on the R statistical programming environment. A major goal of the Bioconductor project is to facilitate the use, analysis, and comprehension of genomic data. The GenomicDataCommons Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the GDC. We expect that Bioconductor developer and bioinformatics community will build on the GenomicDataCommons package to add higher-level functionality and expose cancer genomics data to many state-of-the-art bioinformatics methods available in Bioconductor.


TCGABiolinks

Authors: Tiago Chedraoui Silva, Antonio Colaprico, Catharina Olsen, Michele Ceccarelli, Gianluca Bontempi, and Houtan Noushmehr

TCGAbiolinks was developed as an R/Bioconductor to address challenges with data mining and analysis of cancer genomics data stored at GDC. We offer bioinformatics solutions by using a guided workflow to allow users to query, download, and perform integrative analyses of GDC data. We combined methods from computer science and statistics into the pipeline and incorporated methodologies developed in previous TCGA marker studies. We also provide a graphics user interface (GUI) version of TCGAbiolinks that can run on a user's local machine. TCGAbiolinksGUI contains all the features of the R-version yet allows users an easier way to navigate the analysis steps. We provide online documentations, tutorials, and video guides to assist users with the analysis.


GDCtools

Author: Broad Institute Genome Data Analysis Center

GDCtools is a set of open-source, config-file driven Python and UNIX CLI utilities for interacting with the NCI Genomics Data Commons and automating data cleansing, aggregation and reporting steps that are common to most data-driven science projects. It grew from efforts at the Broad Institute to connect the GDAC Firehose pipeline developed in TCGA to use the GDC as its primary source of data, but aims to go well beyond that. By wrapping the GDC API in a set of rigorously defined and domain-aware tools, GDCtools lets users interact with the GDC in memes familiar to them—as biomedical researchers and informaticians—rather than as web or database programmers. This can make it simpler to search and retrieve either legacy or harmonized data & metadata from the GDC, and shrink the learning and staffing curves, while providing indispensable features such as:

  • Turnkey creation of date-stamped snapshots of data
  • Aggregating multiple samples into a single bolus for ready consumption by scientific algorithms
  • Ensuring that samples are identifiable by project (e.g. restoring TCGA ids to SNP6 segments)
  • Sample report and sample freeze list (load file) creation, for either on-premise or cloud storage (e.g. Google)
  • Aggregate cohort construction (e.g. combining TCGA STAD + ESCA cohorts into STES, with just 1 line in a config file)
  • Retrieving an entire project or just 1 case, with equal ease
  • Easily combining data across multiple projects (e.g. TCGA and CPTAC)

This is all available within a well-tested object-oriented framework that is easy to comprehend and extend by users. GDCtools is online at https://github.com/broadinstitute/gdctools, and includes documentation, examples and a pictorial overview.