Rationale
A major emphasis in the GDC design is to facilitate cross-study data search and aggregation. To do this effectively, sets of baseline metadata elements, whether associated with clinical, biospecimen, or molecular data, must be chosen and assigned standard meanings. These baseline data elements must then be described precisely and promulgated to data submitters, who can then express their relevant data in these terms when submitting to the GDC. Data users may then select and filter analysis datasets within and across different disease cohorts, tumor studies, or other disparate groupings according to fields and values in these baseline metadata sets. Users should have relative confidence that, along the dimensions they selected, variability within their aggregated data cohorts or datasets will be low compared to other signals they may identify. That is, batch effects based on semantic differences between studies will be reduced, at least with respect to the baseline data elements.
Baseline data elements are also selected and defined to provide nominal "minimum data requirements" for submission to the GDC. Clinical data is arguably the most important metadata that can accompany cancer genomic data. One can imagine that the larger the baseline set of clinical elements, the greater the potential for clinically relevant results to emerge from tumor-normal genomic comparisons. On the other hand, data submissions will decrease with increasing numbers of required elements, and semantic differences among projects (and the resulting batch effects) are likely to become larger as well. Both eventualities will reduce power to observe clinically relevant effects across studies.
With this tradeoff in mind, the GDC Data Model Working Group developed a strategy to select an initial set of common data elements that would apply across many tumor types, possess well established semantic information, and pass expert subjective review. We established working policies for consistently calculating and representing the values of certain terms (particularly time-related terms). We also developed a working process for accepting data that may not be complete with respect to the minimal clinical set proposed. This list of elements forms the set of baseline clinical elements for which the GDC will explicit request data from submitters. This is also the set of clinical elements the GDC will index across all projects to enable GDC-wide clinical search and data filtering.
Methods
Clinical Data Elements
The Center for Cancer Genomics provided an initial set of three absolute requirements: Age, Diagnosis, and Gender of study participants. A policy was established that a case may not be defined in the GDC system without values for each of these elements.
We determined the intersection of clinical data elements (CDE) common across all TCGA projects and TARGET projects, and added the resulting eight elements to the GDC baseline set of CDEs. These elements included:
- Gender
- Race
- Ethnicity
- Vital Status
- Year of Diagnosis
- Age at Diagnosis
- Year of Last Followup/Contact, and
- Survival Time.
To help manage CDEs in the GDC data model and to provide a framework for adding additional elements, we then established the following clinical element categories:
- Demographics
- Diagnosis
- Family History
- Exposure, and
- Treatment
and set an initial goal to identify 5-10 terms in each category to add to the GDC baseline clinical element set. We used terms required across studies in either TARGET or TCGA programs as an initial pool of elements. These terms have the advantage of having standard definitions and value sets defined in the NCI Cancer Data Standards Registry and Repository (caDSR). Based on CCG Program Office experience in handling requests for clinical data of these programs, we narrowed this set to those elements with the highest apparent value to research working groups.
We vetted the resulting set of elements with CCG clinical and clinical data experts, incorporated their feedback, and sent the final draft set to two external experts (one within NCI, one outside NCI). After incorporating external feedback, we had identified a baseline GDC clinical data set of 39 elements.
Element Definitions and Values
There are numerous complications involved in harmonizing clinical data items across different studies. Certain terms reflect identical concepts, but term values may be expressed in different units of measure; for example, TARGET studies record Age at Diagnosis in days, while TCGA studies record this information in years. Other terms, such as Tumor Stage, have values dependent on an underlying scale or model that may not be reflected directly in the data as submitted. Absolute dates (i.e., year, month, and day) of clinical events are considered Personally Identifiable Information, and so must be converted into periods of time relative to some index date before storing the data at the GDC. This index date must be selected and consistently adhered to when calculating and storing the relevant time intervals.
For the terms we selected, we associated definitions and accepted values and/or data types to each term using established public clinical data authorities caDSR CDEBrowser and the NCI Thesaurus where possible. These definitions and accepted values are expressed in computable formats (JSON Schema-valid JSON) that the system can reference to perform input data validation and to provide users with standard semantic information through the GDC interfaces. The set of JSON files forms a "glossary" for the GDC that represents the latest GDC-adopted semantic information for its clinical data elements.
We determined that all absolute dates would be represented in the GDC as intervals of time from the date of inital diagnosis. The actual date of diagnosis would never be stored at the GDC.
For certain elements, additional accepted values (such as "Not Able to Collect" for terms that in some jurisdictions may be unlawful to collect from patients) were added to the standardized set of accepted values. These additions are persisted in the JSON files referred to above.
Required, Preferred, and Optional Elements
Each term was denoted either "required", "preferred", or "optional". The GDC intends to encourage submission of clinical and molecular data from individual investigators and small collaborations who may not have access to extensive clinical data on their cases. Because of this, only three terms (Age, Gender, and Diagnosis) are absolutely required to have actual values in submissions. The remaining "required" fields must be present in submission material, but the value "NA" will be valid. Terms denoted "preferred" are strongly encouraged terms denoted "optional" need not be present in submission material.
Discussion and Next Steps
We have established a baseline set of clinical data elements, with definitions, accepted values, and submission and persistence protocols to be used for GDC-wide searching and data filtering, and to inform clinical data submission and validation at GDC. This baseline set will be added to and otherwise modified as necessary over time by the GDC Data Model Working Group, with assistance and advice from NCI and external clinical experts in a manner similar to that described in this report.
Additional sets of elements are needed. In particular, the means to express longitudinal, follow up data will be devised and implemented, and disease-specific vocabularies will also be developed.
Additional standard operating procedures (SOPs) for updating the clinical element glossary and for curating submitted data with respect to the glossary are also required and will be developed.
GDC Data Model Working Group
- Allison Heath, PhD
- Josh Miller
- Junjun Zhang, PhD
- Greg Korzeniewski, PhD
- Sharon Gaheen
- Himanso Sahni
- Mark Jensen, PhD
Internal Advisors
- Louis B. Staudt, MD, PhD
- Jean-Claude Zenklusen, PhD
- Daniela Gerhard, PhD
- Zhining Wang, PhD
- Martin Ferguson, PhD
External Advisors
- Stephen Chanock, MD
- Douglas Levine, MD, PhD