GDC Community Tools

Explore the External Tools

10 External Tools

GDC MAF Aggregation Tool

Authors: GDC Development Team| Posted Date:Apr, 2020

The GDC MAF tool aggregates aliquot-level MAFs, which originate from one tumor-normal pair. MAFs can aggregated on a project-level or by providing a set of files/cases. Note that currently the GDC MAF tool only supports Ensemble aliquot-level MAFs generated from whole exome sequencing. Ensemble aliquot-level MAFs include variants from all five variant callers (MuTect2, MuSE, Varscan2, Pindel) and include information about which caller each variant originated from. The GDC MAF tool will only aggregate MAFs from within one GDC project.

gdc-readgroups

Data Submission

Authors: Jeremiah Savage - GDC Bioinformatics Team| Posted Date:Apr, 2019

The gdc-readgroups tool is a great starting point for any group that plans on submitting molecular data in BAM format to the Genomic Data Commons. This tool automatically retrieves required and useful read group metadata from a BAM file header and generates a format submittable to the GDC.

Improved DNA Methylation Array Probe Annotation

Data Annotation

Authors: Wanding Zhou, Ben Berman, Peter W. Laird, Hui Shen| Posted Date:Nov, 2018

This collection of annotations include hg38-based genomic coordinates for Illumina Infinium HumanMethylation27, HumanMethylation450, MethylationEPIC arrays, and masks for probes with low quality (MASK_general column). It also contains detailed information on overlapping gene (including distance of TSS), promoter and CpG island. In addition, functional annotations including overlap with ENCODE and ROADMAP ChromHMM chromatin states and Transcription Factor Binding Sites (TFBS) are also available.

Improved DNA Methylation Array Probe Annotation

SeSAMe (SEnsible Step-wise Analysis of Methylation data)

Data Processing

Authors: Wanding Zhou, Timothy Triche Jr., Peter W. Laird, Hui Shen| Posted Date:Nov, 2018

Tool for analyzing Infinium DNA methylation array data.

The SeSAME Tool:

Reduction to artifactual detection from Infinium of DNA methylation microarrays
Low-level processing of Illumina Infinium DNA methylation array
Quality control of DNA methylation arrays
Biological inference (sex at birth, age, karyotypes, copy number, etc.)

SeSAMe Bioconductor Package

BISCUIT (BISulfite-seq CUI Toolkit)

Data Processing

Authors: Wanding Zhou, Jacob Morrison, Timothy Triche Jr. , Peter W. Laird , Hui Shen| Posted Date:Nov, 2018

Tool suite for analyzing high throughput bisulfite sequencing data.

BISCUIT performs:

Alignment and quality control of bisulfite sequencing reads
Extraction of DNA methylation level
Extraction of genetic information from bisullfite-sequencing data
Analysis of allele-specific methylation and methylation haplotype

GDC RNASeq Tool

Data Access Data Analysis

Authors: Colin Reid, GDC User Services| Posted Date:Feb, 2018

The GDC RNASeq Tool downloads / merges individual RNASeq files from the GDC Data Portal into a matrices identified by TCGA barcode.

The GDC RNASeq Tool:

Downloads RNA-Seq / miRNA-Seq data files using a GDC manifest file
Unzips the files into separate folders identified by experimental strategy and bioinformatics workflow
Merges the files into separate matrix files

GDC RNASeq Tool - Github Repository

GenomicDataCommons R-Package

Data Access Data Analysis

Authors: | Posted Date:May, 2017

The National Cancer Institute (NCI) Genomic Data Commons provides the cancer research community with an open and unified repository for sharing and accessing data across numerous cancer studies and projects via a high-performance data transfer and query infrastructure. The Bioconductor project is an open source and open development software project built on the R statistical programming environment. A major goal of the Bioconductor project is to facilitate the use, analysis, and comprehension of genomic data. The GenomicDataCommons Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the GDC. We expect that Bioconductor developer and bioinformatics community will build on the GenomicDataCommons package to add higher-level functionality and expose cancer genomics data to many state-of-the-art bioinformatics methods available in Bioconductor.

TCGABiolinks

Data Access Data Analysis

Authors: Tiago Chedraoui Silva, Antonio Colaprico, Catharina Olsen, Michele Ceccarelli, Gianluca Bontempi, Houtan Noushmehr| Posted Date:May, 2017

TCGAbiolinks was developed as an R/Bioconductor to address challenges with data mining and analysis of cancer genomics data stored at GDC. We offer bioinformatics solutions by using a guided workflow to allow users to query, download, and perform integrative analyses of GDC data. We combined methods from computer science and statistics into the pipeline and incorporated methodologies developed in previous TCGA marker studies. We also provide a graphics user interface (GUI) version of TCGAbiolinks that can run on a user's local machine. TCGAbiolinksGUI contains all the features of the R-version yet allows users an easier way to navigate the analysis steps. We provide online documentations, tutorials, and video guides to assist users with the analysis.

GDC TSV Downloader

Data Access

Authors: Bill Wysocki, GDC User Services| Posted Date:May, 2017

The GDC TSV downloader allows the user to use a Manifest from the GDC Data Portal to download clinical and biospecimen metadata for a set of files in a tab-delimited format.

GDCtools

Data Access

Authors: Broad Institute Genome Data Analysis Center| Posted Date:May, 2017

GDCtools is a set of open-source, config-file driven Python and UNIX CLI utilities for interacting with the NCI Genomics Data Commons and automating data cleansing, aggregation and reporting steps that are common to most data-driven science projects. It grew from efforts at the Broad Institute to connect the GDAC Firehose pipeline developed in TCGA to use the GDC as its primary source of data, but aims to go well beyond that. By wrapping the GDC API in a set of rigorously defined and domain-aware tools, GDCtools lets users interact with the GDC in memes familiar to them—as biomedical researchers and informaticians—rather than as web or database programmers. This can make it simpler to search and retrieve harmonized data & metadata from the GDC, and shrink the learning and staffing curves, while providing indispensable features such as:

Turnkey creation of date-stamped snapshots of data
Aggregating multiple samples into a single bolus for ready consumption by scientific algorithms
Ensuring that samples are identifiable by project (e.g. restoring TCGA ids to SNP6 segments)
Sample report and sample freeze list (load file) creation, for either on-premise or cloud storage (e.g. Google)
Aggregate cohort construction (e.g. combining TCGA STAD + ESCA cohorts into STES, with just 1 line in a config file)
Retrieving an entire project or just 1 case, with equal ease
Easily combining data across multiple projects (e.g. TCGA and CPTAC)

This is all available within a well-tested object-oriented framework that is easy to comprehend and extend by users. GDCtools is online at https://github.com/broadinstitute/gdctools, and includes documentation, examples and a pictorial overview.

Need Assistance?

Need help with data retrieval, download, or submission?

Visit the GDC Support Page