Selecting Common Cross-Study Clinical Data Elements

GDC Data Model Working Group

Rationale

A major emphasis in the GDC design is to facilitate cross-study data search and aggregation. To do this effectively, sets of baseline metadata elements, whether associated with clinical, biospecimen, or molecular data, must be chosen and assigned standard meanings. These baseline data elements must then be described precisely and promulgated to data submitters, who can then express their relevant data in these terms when submitting to the GDC. Data users may then select and filter analysis datasets within and across different disease cohorts, tumor studies, or other disparate groupings according to fields and values in these baseline metadata sets. Users should have relative confidence that, along the dimensions they selected, variability within their aggregated data cohorts or datasets will be low compared to other signals they may identify. That is, batch effects based on semantic differences between studies will be reduced, at least with respect to the baseline data elements.

Baseline data elements are also selected and defined to provide nominal "minimum data requirements" for submission to the GDC. Clinical data is arguably the most important metadata that can accompany cancer genomic data. One can imagine that the larger the baseline set of clinical elements, the greater the potential for clinically relevant results to emerge from tumor-normal genomic comparisons. On the other hand, data submissions will decrease with increasing numbers of required elements, and semantic differences among projects (and the resulting batch effects) are likely to become larger as well. Both eventualities will reduce power to observe clinically relevant effects across studies.

With this tradeoff in mind, the GDC Data Model Working Group developed a strategy to select an initial set of common data elements that would apply across many tumor types, possess well established semantic information, and pass expert subjective review. We established working policies for consistently calculating and representing the values of certain terms (particularly time-related terms). We also developed a working process for accepting data that may not be complete with respect to the minimal clinical set proposed. This list of elements forms the set of baseline clinical elements for which the GDC will explicit request data from submitters. This is also the set of clinical elements the GDC will index across all projects to enable GDC-wide clinical search and data filtering.

Methods

Clinical Data Elements

The Center for Cancer Genomics provided an initial set of three absolute requirements: Age, Diagnosis, and Gender of study participants. A policy was established that a case may not be defined in the GDC system without values for each of these elements.

We determined the intersection of clinical data elements (CDE) common across all TCGA projects and TARGET projects, and added the resulting eight elements to the GDC baseline set of CDEs. These elements included:

  • Gender
  • Race
  • Ethnicity
  • Vital Status
  • Year of Diagnosis
  • Age at Diagnosis
  • Year of Last Followup/Contact, and
  • Survival Time.

To help manage CDEs in the GDC data model and to provide a framework for adding additional elements, we then established the following clinical element categories:

  • Demographics
  • Diagnosis
  • Family History
  • Exposure, and
  • Treatment

and set an initial goal to identify 5-10 terms in each category to add to the GDC baseline clinical element set. We used terms required across studies in either TARGET or TCGA programs as an initial pool of elements. These terms have the advantage of having standard definitions and value sets defined in the NCI Cancer Data Standards Registry and Repository (caDSR). Based on CCG Program Office experience in handling requests for clinical data of these programs, we narrowed this set to those elements with the highest apparent value to research working groups.

We vetted the resulting set of elements with CCG clinical and clinical data experts, incorporated their feedback, and sent the final draft set to two external experts (one within NCI, one outside NCI). After incorporating external feedback, we had identified a baseline GDC clinical data set of 39 elements.

Element Definitions and Values

There are numerous complications involved in harmonizing clinical data items across different studies. Certain terms reflect identical concepts, but term values may be expressed in different units of measure; for example, TARGET studies record Age at Diagnosis in days, while TCGA studies record this information in years. Other terms, such as Tumor Stage, have values dependent on an underlying scale or model that may not be reflected directly in the data as submitted. Absolute dates (i.e., year, month, and day) of clinical events are considered Personally Identifiable Information, and so must be converted into periods of time relative to some index date before storing the data at the GDC. This index date must be selected and consistently adhered to when calculating and storing the relevant time intervals.

For the terms we selected, we associated definitions and accepted values and/or data types to each term using established public clinical data authorities caDSR CDEBrowser and the NCI Thesaurus where possible. These definitions and accepted values are expressed in computable formats (JSON Schema-valid JSON) that the system can reference to perform input data validation and to provide users with standard semantic information through the GDC interfaces. The set of JSON files forms a "glossary" for the GDC that represents the latest GDC-adopted semantic information for its clinical data elements.

We determined that all absolute dates would be represented in the GDC as intervals of time from the date of inital diagnosis. The actual date of diagnosis would never be stored at the GDC.

For certain elements, additional accepted values (such as "Not Able to Collect" for terms that in some jurisdictions may be unlawful to collect from patients) were added to the standardized set of accepted values. These additions are persisted in the JSON files referred to above.

Required, Preferred, and Optional Elements

Each term was denoted either "required", "preferred", or "optional". The GDC intends to encourage submission of clinical and molecular data from individual investigators and small collaborations who may not have access to extensive clinical data on their cases. Because of this, only three terms (Age, Gender, and Diagnosis) are absolutely required to have actual values in submissions. The remaining "required" fields must be present in submission material, but the value "NA" will be valid. Terms denoted "preferred" are strongly encouraged terms denoted "optional" need not be present in submission material.

Results

The initial baseline clinical set is given in the following table.

Clinical Data Elements

Term Category Definition Accepted Values/Type CDE (Collection) Required / Preferred / Optional
Age At Diagnosis Diagnosis The numerical response indicating the age of a person, expressed in years, when they were diagnosed with a disease or disorder. number (units: years) 4828691 (caDSR) Optional
Alcohol History Exposure A response to a question that asks whether the participant has consumed at least 12 drinks of any kind of alcoholic beverage in their lifetime. 'Don't Know/Not Sure', 'No', 'Yes' 2201918 (caDSR) Optional
Alcohol Intensity Exposure Category to describe the patient's current level of alcohol use as self-reported by the patient. 'None', 'Not Evaluated', 'Daily Drinker', 'Occasional Drinker', 'Social Drinker', 'Weekly Drinker' 3457767 (caDSR) Optional
BMI Exposure The body mass divided by the square of the body height expressed in units of kg/m2. number (units: kilograms per square meter) 4973892 (caDSR) Optional
Cigarettes Per Day Exposure The average number of cigarettes smoked per day. number (units: cigarettes) 2001716 (caDSR) Optional
Classification Of Tumor Diagnosis The point in clinical disease progression at which a tumor sample was taken. 'Primary', 'Metastasis', 'Recurrence', 'Other' 808b7f0 (GDC) Required
Days To Birth Diagnosis Time interval from a person's date of birth to the date of initial pathologic diagnosis, represented as a calculated number of days. number (units: days) 3008233 (caDSR/TCGA) Preferred
Days To Last Followup Diagnosis Time interval from the date of last followup to the date of initial pathologic diagnosis, represented as a calculated number of days. number (units: days) 3008273 (caDSR) Required
Days To Last Known Disease Status Diagnosis Time interval from the date of last followup to the date of initial pathologic diagnosis, represented as a calculated number of days. number (units: days) 3008273 (caBIG) Required
Days To Recurrence Diagnosis Time interval from the date of disease recurrence to the date of initial pathologic diagnosis, represented as a calculated number of days. number (units: days) 3008295 (caDSR) Required
Days To Treatment Treatment Number of days from date of initial pathologic diagnosis that treatment began. number (units: days) 56294eaf (GDC) Optional
Ethnicity Demographic Classification based on social groupings that are characterized by a distinctive social and cultural tradition maintained from generation to generation, a common history and origin and a sense of identification with the group; members of the group have distinctive features in their way of life, shared experiences and often a common genetic heritage; these features may be reflected in their experience of health and disease. 'hispanic or latino', 'not hispanic or latino', 'not reported', 'not allowed to collect' 2192213 (caDSR) Required
Gender Demographic Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles. 'Female', 'Male', 'Unknown', 'Unspecified' 2200604 (caDSR) Required
Height Exposure The height of the patient in centimeters. number (units: centimeters) 649 (caDSR) Optional
Last Known Disease Status Diagnosis The state or condition of an individual's neoplasm at a particular point in time. 'Biochemical evidence of disease without structural correlate', 'Distant met recurrence/progression', 'Loco-regional recurrence/progression', 'Tumor free', 'Unknown tumor status', 'With tumor' 2759550 (caDSR) Required
Morphology Diagnosis The third edition of the International Classification of Diseases for Oncology, published in 2000, used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms._The study of the structure of the cells and their arrangement to constitute tissues and, finally, the association among these to form organs. In pathology, the microscopic process of identifying normal and abnormal morphologic characteristics in tissues, by employing various cytochemical and immunocytochemical stains._A system of numbered categories for representation of data. string (Codes as provided at ICD-O-3 Online) 3226275 (caDSR) Required
Primary Diagnosis Diagnosis The investigation, analysis and recognition of the presence and nature of disease, condition, or injury from expressed signs and symptoms; also, the scientific determination of any kind; the concise results of such an investigation. The term primary indicates that this is the intial diagnosis made or provided in the context of the associated study. string (IDC-10 code) C15220 (NCIt) Required
Prior Malignancy Diagnosis Text term to describe the patient's history of prior cancer diagnosis and the spatial location of any previous cancer occurrence. GDC: allow yes/no/unknown as accepted values. 'Yes', 'No', 'Unknown' 3382736 (caDSR) Optional
Progression Or Recurrence Diagnosis Text indicator to represent the worsening or progression, recurrence of the cancer condition under investigation since the time of last contact with the subject. ''Yes'', ''No'', 'Unevaluable', 'Unknown' 3830556 (caDSR) Required
Race Demographic An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. 'white', 'american indian or alaska native', 'black or african american', 'asian', 'native hawaiian or other pacific islander', 'other', 'not reported', 'not allowed to collect' C17049 (NCIt) Required
Relationship Age At Diagnosis Family History (Applied to the case's relative) The numerical response indicating the age of a person, expressed in years, when they were diagnosed with a disease or disorder. number (units: years) 4828691 (caDSR) Optional
Relationship Gender Family History (As applied to the case's relative) Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles. 'Female', 'Male', 'Unknown', 'Unspecified' 2200604 (caDSR) Optional
Relationship Type Family History The subgroup that describes the state of connectedness between members of the unit of society organized around kinship ties. 'Parent', 'Mother', 'Father', 'Child', 'Son', 'Daughter', 'Sibling', 'Brother', 'Sister', 'None', 'Don't Know' 2690165 (caDSR) Optional
Relative With Cancer History Family History Indicator to signify whether or not an individual's biological relative has been diagnosed with another type of cancer. (GDC definition: Indicator to signify whether or not an individual's biological relative has been diagnosed with any type of cancer.) 'Yes', 'No', 'Unknown' 3901752 (caDSR) Optional
Site Of Resection Or Biopsy Diagnosis Term to represent the name of the organ resected that was contiguous to the primary disease site. GDC: Also includes liquid biopsy material. string (Codes as provided at ICD-O-3 Online) 3162811 (caDSR) Preferred
Smoking History Exposure Category describing current smoking status and smoking history as self-reported by a patient. 'Lifelong Non-Smoker', 'Current Smoker', 'Current Reformed Smoker for > 15 yrs', 'Current Refomed Smoker for < or = 15 yrs', 'Current Reformed Smoker Duration Not Specified' 2181650 (caDSR) Optional
Smoking Intensity Exposure Numeric computed value to represent lifetime tobacco exposure defined as number of cigarettes smoked per day x number of years smoked divided by 20. number (units: pack-years) 2955385 (caDSR) Optional
Therapeutic Agent Treatment Any natural, endogenously-derived, synthetic or semi-synthetic compound with pharmacologic activity. A pharmacologic substance has one or more specific mechanism of action(s) through which it exerts one or more effect(s) on the human or animal body. They can be used to potentially prevent, diagnose, treat or relieve symptoms of a disease. Formulation specific agents and some combination agents are also classified as pharmacologic substances. string (String value is the agent name as found in NCI Thesaurus, which indexes approved drugs via RxNorm as well as all experimental drugs.) C1909 (NCIt) Optional
Tissue Or Organ Of Origin Diagnosis The third edition of the International Classification of Diseases for Oncology, published in 2000, used principally in tumor and cancer registries for coding the site (topography) and the histology (morphology) of neoplasms._The description of an anatomical region or of a body part._Named locations of, or within, the body._A system of numbered categories for representation of data. string (Codes as provided at ICD-O-3 Online) 3226281 (caDSR) Required
Treatment Intent Type Treatment Text term to identify the reason for the administration of a treatment regimen. 'Adjuvant', 'Cancer control', 'Cure', 'Initial', 'Other', 'Palliative', 'Prevention', 'Primary', 'Progression', 'Progression after initial', 'Recurrence' 2793511 (caDSR) Optional
Treatment Or Therapy Treatment A yes/no/unknown/not applicable indicator related to the administration of therapeutic agents received before the body specimen was collected. ''Yes'', ''No'', 'Not Applicable', 'Unknown' 4231463 (caDSR) Optional
Tumor Grade Diagnosis The degree of abnormality of cancer cells, a measure of differentiation, the extent to which cancer cells are similar in appearance and function to healthy cells of the same tissue type. The degree of differentiation often relates to the clinical behavior of the particular tumor. Based on the microscopic findings, tumor grade is commonly described by one of four degrees of severity. Histopathologic grade of a tumor may be used to plan treatment and estimate the future course, outcome, and overall prognosis of disease. Certain types of cancers, such as soft tissue sarcoma, primary brain tumors, lymphomas, and breast have special grading systems. string (The accepted values for tumor_grade depend on the tumor site, type, and accepted grading system. These items should accompany the tumor_grade value as associated metadata.) C18000 (NCIt) Required
Tumor Stage Diagnosis The extent of a cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes contain cancer, and whether the cancer has spread from the original site to other parts of the body. string (The accepted values for tumor_stage depend on the tumor site, type, and accepted staging system. These items should accompany the tumor_stage value as associated metadata.) C16899 (NCIt) Required
Vital Status Diagnosis The state or condition of being living or deceased; also includes the case where the vital status is unknown. 'alive', 'dead', 'lost to follow-up', 'unknown' C25717 (NCIt) Required
Weight Exposure The weight of the patient measured in kilograms. number (units: kilograms) 651 (caDSR) Optional
Year Of Birth Demographic Numeric value to represent the calendar year in which an individual was born. number (units: year) 2896954 (caDSR) Required
Year Of Death Demographic Numeric value to represent the year of the death of an individual. number (units: year) 2897030 (caDSR) Preferred
Years Smoked Exposure Numeric value (or unknown) to represent the number of years a person has been smoking. number (units: years) 3137957 (caDSR) Required

Discussion and Next Steps

We have established a baseline set of clinical data elements, with definitions, accepted values, and submission and persistence protocols to be used for GDC-wide searching and data filtering, and to inform clinical data submission and validation at GDC. This baseline set will be added to and otherwise modified as necessary over time by the GDC Data Model Working Group, with assistance and advice from NCI and external clinical experts in a manner similar to that described in this report.

Additional sets of elements are needed. In particular, the means to express longitudinal, follow up data will be devised and implemented, and disease-specific vocabularies will also be developed.

Additional standard operating procedures (SOPs) for updating the clinical element glossary and for curating submitted data with respect to the glossary are also required and will be developed.

GDC Data Model Working Group

  • Allison Heath, PhD
  • Josh Miller
  • Junjun Zhang, PhD
  • Greg Korzeniewski, PhD
  • Sharon Gaheen
  • Himanso Sahni
  • Mark Jensen, PhD

Internal Advisors

  • Louis B. Staudt, MD, PhD
  • Jean-Claude Zenklusen, PhD
  • Daniela Gerhard, PhD
  • Zhining Wang, PhD
  • Martin Ferguson, PhD

External Advisors

  • Stephen Chanock, MD
  • Douglas Levine, MD, PhD

(Created on: February 13th, 2017 • Last updated on: June 12th, 2017)