GDCtools is a set of open-source, config-file driven Python and UNIX CLI utilities for interacting with the NCI Genomics Data Commons and automating data cleansing, aggregation and reporting steps that are common to most data-driven science projects. It grew from efforts at the Broad Institute to connect the GDAC Firehose pipeline developed in TCGA to use the GDC as its primary source of data, but aims to go well beyond that. By wrapping the GDC API in a set of rigorously defined and domain-aware tools, GDCtools lets users interact with the GDC in memes familiar to them—as biomedical researchers and informaticians—rather than as web or database programmers. This can make it simpler to search and retrieve harmonized data & metadata from the GDC, and shrink the learning and staffing curves, while providing indispensable features such as:
- Turnkey creation of date-stamped snapshots of data
- Aggregating multiple samples into a single bolus for ready consumption by scientific algorithms
- Ensuring that samples are identifiable by project (e.g. restoring TCGA ids to SNP6 segments)
- Sample report and sample freeze list (load file) creation, for either on-premise or cloud storage (e.g. Google)
- Aggregate cohort construction (e.g. combining TCGA STAD + ESCA cohorts into STES, with just 1 line in a config file)
- Retrieving an entire project or just 1 case, with equal ease
- Easily combining data across multiple projects (e.g. TCGA and CPTAC)
This is all available within a well-tested object-oriented framework that is easy to comprehend and extend by users. GDCtools is online at https://github.com/broadinstitute/gdctools, and includes documentation, examples and a pictorial overview.