The GDC for TCGA Data Access Matrix Users

The purpose of this document is to help users familiar with the TCGA Data Access Matrix locate data in the GDC.

The TCGA Data Portal provided the Data Access Matrix, which enabled users to build archives of desired data files. Within a given cancer project, a user could filter files based on data type, genomic characterization center, sample type, data level, sample ID, data access tier, and date submitted. Data archives matching the filter were displayed in a matrix in which users could view the availability of data types for samples and further refine the selection. To retrieve the data, the Data Access Matrix application built an archive containing the files requested and made this available to users at a link provided in the browser and by email. Downloaded archives were organized in a directory structure that separated files according to data type. The download included metadata and annotations that described the relationships between files and samples and additional information concerning the cases.

The Genomic Data Commons makes TCGA data available via two data portals: the GDC Data Portal and the GDC Legacy Archive.

The GDC Legacy Archive has much of the functionality of the TCGA Data Access Matrix and provides access to all TCGA data previously stored in the TCGA Data Portal, including array-based analysis data, MAF and VCF files, and clinical and biospecimen data.

The GDC Data Portal provides access to the subset of TCGA data that has been harmonized by the GDC using its data generation and harmonization pipelines. TCGA data in the GDC Data Portal includes BAM files aligned to the latest human genome build (GRCh38), VCF files containing variants called by the GDC, and RNA-Seq expression data harmonized by the GDC.

Overview

The TCGA Data Portal organized data according to sets of center-uploaded files called archives. The Data Access Matrix allowed users to select archives based on metadata describing the cases, samples, and datatypes desired. The matrix view allowed user to select the archives for download. The downloaded archives were provided in a single file in tar or tar.gz format via a link to the the TCGA Data Portal.

The GDC Legacy Archive, in contrast, provides access at the level of the individual data file. The user accumulates files for download via a shopping cart mechanism, adding files that are returned by a user query to the cart. When the cart contains all desired files, users may download the files via their web browser directly from the portal. For large (> 5Gb) downloads, the user retrieves a manifest or list of files and uses the manifest with the standalone GDC Data Transfer Tool to retrieve the actual files. The Data Transfer Tool outputs tar archives containing the files requested.

Filtering Data Files for Project and Other Metadata

The Data Access Matrix query start page provided dropdown menus for the initial archive filters. The GDC Legacy Archive has two tabs for faceted search, Cases and Files, that allow the user to filter files along many of the Data Access Matrix query fields.

In the GDC Cases tab, check "TCGA" in the Cancer Program facet. Facets available in the Cases and Files tabs map roughly to the Data Access Matrix menus as follows:

Data Access Matrix Menu Facet(s) in GDC Cases Tab Facet(s) in GDC Files Tab Note
Disease Primary Site, Project, Disease Type
Data Type Data Category, Data Type, Experimental Strategy, Data Format
Center/Platform Platform
Access Tier Access Level
Batch Number NA NA This field is not indexed in the GDC Legacy Archive
Sample ID (+ add row/sample list) Case Submitter ID Prefix Partial functionality; see below
Tumor/Normal sample_type (custom facet) See below
Data Level NA NA
Submitted Since/Up To NA NA Not implemented in the GDC Legacy Archive
Availability NA NA All data returned by searches in the GDC Legacy Archive are available for download

Selecting Specific Data Types and Levels

In the TCGA Data Access Matrix, data types could be selected in the initial query, and further refined in the matrix view. Data level 1, 2, and 3 could also be selected from a menu in the initial query page. The basis of these groupings was the organization of TCGA archives; each archive was specified with a platform, data type, and data level.

The GDC Legacy Archive organizes individual files with more flexibility. The platform field is available as a File tab facet. Data types are more granular in the GDC Legacy Archive; files can also be selected along different axes: Category, Type, Experimental Strategy, and Data Format (file extension). Users having trouble selecting a desired set of files can contact GDC Support.

At the TCGA DCC, a data level was defined in terms of a specific combination of both processing level (raw, normalized, integrated) and access level (controlled or open access). Thus, for example, a MAF file containing germline variant was classified as Level 2, while a MAF file containing only validated somatic variants was calssified as Level 3. Rather than displaying original data levels, the GDC Legacy Archive facets organize file content and access level separately. For example, selecting "MAF" under "Data Format" in the File tab displays all MAF files, some of which are controlled and some open, as indicated in the Access column of the table view. The Access Level facet can further filter this set to provide only controlled or only open access MAFs.

Selecting Filtered Files for Specific Samples

The TCGA Data Matrix allowed users to use the matrix view to select files for custom sets of samples. The GDC Legacy Archive does not implement a method to filter by a custom set of samples; downloads will consist of files for all samples meeting the criteria selected in the Case tab. The Case tab includes facets for certain clinical features (Gender, Age at Diagnosis, Days to Death, Race, and Ethnicity) that may assist some users.

Based on the limited use of this feature in the TCGA Data Matrix, the GDC team expects that this will not present a problem for most GDC Legacy Archive users. We welcome your feedback and questions: please contact GDC Support.

Selecting Files for Tumor or Normal Samples

The GDC Legacy Archive allows filtering for tumor or normal sample data files using a custom facet. To do this, add a custom Cases filter using the field samples.sample_type.

Downloading Files

The TCGA Data Matrix created downloadable archives based on user-selected archives. The Data Matrix application provided a link via the browser or user email that a user would access via a browser to retrieve the files.

The GDC Legacy Archive assembles selected files in a download cart and provides either a direct download from the cart page or via the standalone GDC Data Transfer Tool. Once the desired files are displayed in the Files Table, click the "Add all files to the Cart" button. Move to the download cart by clicking "Cart" at the upper right corner of the page. Manual adjustments to the selection may be made on the Cart page.

To download, click the "Download" button and select "Cart" from the options. This will start the download of the tar-archived files via the web browser. Metadata that describe the associations of each file with cases, samples, aliquots, source center, and other entities can (and should) also be retrieved by clicking the "File Metadata" option.

Cart Limits

For technical reasons, direct download from the cart via the browser is limited to 10,000 files and/or 5 Gb of data. This limit can be avoided by using the GDC Data Transfer Tool with a manifest file. A manifest file can be retrieved directly from the File Table view (allowing for greater than 10,000 files to be obtained), or from the Cart page. Please see the Data Transfer Tool User Guide for detailed instructions to retrieve files and metadata.

Other Features

The GDC Legacy Archive supports advanced queries via the GDC API. Please review the GDC API User's Guide and contact GDC Support if you have any questions or feedback.