Biospecimen Data Standardization

Biospecimen data refers to information associated with the physical sample taken from a participant and its processing down to the aliquot level for sequencing experiments. This data falls into several key categories:

  • Standard Identifiers: project-unique identifiers and universally unique identifiers (UUIDs) that enable cases and samples to be referenced and linked to associated clinical and analytical data
  • Provenance: metadata that indicates the upstream sources of the sample (research program, research project, and donor individual) as well as the downstream products of sample processing (e.g., extracted DNA or RNA analyte)
  • Quality Control: metadata that express the values of quality control tests performed on biospecimens and analyzed products (e.g., percent tumor nuclei, RIN values, A260/A240 values)

For major NCI CCG programs, biospecimen data is provided by a Biospecimen Core Resource (BCR) under contract to the NCI. Data is submitted in an established, schema-valid XML format. This data includes program and project identifiers, UUIDs, and the relationships between case, sample, and aliquot. UUIDs submitted by BCRs are typically adopted by the GDC.

For other submitters, data in the BCR XML format is accepted. However, the GDC also provides a simpler means for submission of a minimal set of biospecimen data, in which a data may be formatted in a JSON or tab-delimited (TSV) text file and submitted to the GDC Submission Portal.

The GDC Data Model uses a graph representation that has no technical limits on adjusting the entities and relationships. However there may be effects on quality control, reporting, accounting and user interface/experience. Therefore, major changes to the model needed to support new biospecimen information will undergo review by the GDC Data Model Change Control Board.

Submitting Biospecimen Entities

Links to the dictionary entry for each biospecimen entity are listed below. Each entry contains information about each field and a downloadable template for submission.

Biospecimen Entity Field Information

Term Category CDE Required?
a260 a280 ratio Analyte 5432595 No
adapter name Read Group --- No
adapter sequence Read Group --- No
aliquot quantity Aliquot --- No
aliquot volume Aliquot --- No
amount Aliquot --- No
amount Analyte --- No
analyte quantity Analyte --- No
analyte type id Aliquot 5432508 No
analyte type id Analyte 5432508 No
analyte type Aliquot 2513915 No
analyte type Analyte 2513915 Yes
analyte volume Analyte --- No
base caller name Read Group --- No
base caller version Read Group --- No
biospecimen anatomic site Sample 4742851 No
biospecimen laterality Sample 2007875 No
bone marrow malignant cells Slide --- No
catalog reference Sample --- No
chipseq antibody Read Group --- No
chipseq target Read Group --- No
composition Sample 5432591 No
concentration Aliquot 5432594 No
concentration Analyte 5432594 No
consent type Case --- No
creation datetime Portion 5432592 No
current weight Sample 5432606 No
days to collection Sample 3008340 No
days to consent Case --- No
days to lost to followup Case 6154721 No
days to sample procurement Sample --- No
days to sequencing Read Group --- No
diagnosis pathologically confirmed Sample --- No
disease type Case 6161017 No
distance normal to tumor Sample 3088708 No
distributor reference Sample --- No
experiment name Read Group --- Yes
experimental protocol type Analyte --- No
flow cell barcode Read Group --- No
fragment maximum length Read Group --- No
fragment mean length Read Group --- No
fragment minimum length Read Group --- No
fragment standard deviation length Read Group --- No
fragmentation enzyme Read Group --- No
freezing method Sample 5432607 No
growth rate Sample --- No
includes spike ins Read Group --- No
index date Case 6154722 No
initial weight Sample 5432605 No
instrument model Read Group 5432604 No
intermediate dimension Sample --- No
is ffpe Portion 4170557 No
is ffpe Sample 4170557 No
is paired end Read Group --- Yes
lane number Read Group --- No
library name Read Group --- Yes
library preparation kit catalog number Read Group --- No
library preparation kit name Read Group --- No
library preparation kit vendor Read Group --- No
library preparation kit version Read Group --- No
library selection Read Group --- Yes
library strand Read Group --- No
library strategy Read Group --- Yes
longest dimension Sample 5432602 No
lost to followup Case 6161018 No
method of sample procurement Sample --- No
multiplex barcode Read Group --- No
no matched normal low pass wgs Aliquot --- No
no matched normal targeted sequencing Aliquot --- No
no matched normal wgs Aliquot --- No
no matched normal wxs Aliquot --- No
normal tumor genotype snp match Analyte 4588156 No
number expect cells Read Group --- No
number proliferating cells Slide 5432636 No
oct embedded Sample 5432538 No
passage count Sample --- No
pathology report uuid Sample --- No
percent eosinophil infiltration Slide 2897700 No
percent follicular component Slide --- No
percent granulocyte infiltration Slide 2897705 No
percent inflam infiltration Slide 2897695 No
percent lymphocyte infiltration Slide 2897710 No
percent monocyte infiltration Slide 5455535 No
percent necrosis Slide 2841237 No
percent neutrophil infiltration Slide 2841267 No
percent normal cells Slide 2841233 No
percent rhabdoid features Slide 6790120 No
percent sarcomatoid features Slide 2429786 No
percent stromal cells Slide 2841241 No
percent tumor cells Slide 5432686 No
percent tumor nuclei Slide 2841225 No
platform Read Group --- Yes
portion number Portion 5432711 No
preservation method Sample 5432521 No
primary site Case 6161019 No
prostatic chips positive count Slide --- No
prostatic chips total count Slide --- No
prostatic involvement percent Slide --- No
read group name Read Group --- Yes
read length Read Group --- Yes
ribosomal rna 28s 16s ratio Analyte --- No
rin Read Group 5278775 No
rna integrity number Analyte --- No
sample ordinal Sample --- No
sample type id Sample --- No
sample type Sample 3111302 Yes
section location Slide --- Yes
selected normal low pass wgs Aliquot --- No
selected normal targeted sequencing Aliquot --- No
selected normal wgs Aliquot --- No
selected normal wxs Aliquot --- No
sequencing center Read Group --- Yes
sequencing date Read Group --- No
shortest dimension Sample 5432603 No
single cell library Read Group --- No
size selection range Read Group --- No
source center Aliquot --- No
spectrophotometer method Analyte 3008378 No
spike ins concentration Read Group --- No
spike ins fasta Read Group --- No
target capture kit catalog number Read Group --- No
target capture kit name Read Group --- No
target capture kit target region Read Group --- No
target capture kit vendor Read Group --- No
target capture kit version Read Group --- No
target capture kit Read Group --- Yes
time between clamping and freezing Sample 5432611 No
time between excision and freezing Sample 5432612 No
tissue collection type Sample --- No
tissue microarray coordinates Slide --- No
tissue type Sample 5432687 Yes
to trim adapter sequence Read Group --- No
tumor code id Sample --- No
tumor code Sample --- No
tumor descriptor Sample 3288124 No
weight Portion 5432593 No
well number Analyte 5432613 No