Main Content

Combinatorial and Machine Learning Approaches for Improved Somatic Variant Calling from Formalin-Fixed Paraffin-Embedded Genome Sequence Data

Frontiers in Genetics, April 27, 2022; 10.3389/fgene.2022.834764

Formalin fixation of paraffin-embedded tissue samples is a well-established method for preserving tissue and is routinely used in clinical settings. Although formalin-fixed, paraffin-embedded (FFPE) tissues are deemed crucial for research and clinical applications, the fixation process results in molecular damage to nucleic acids, thus confounding their use in genome sequence analysis. Methods to improve genomic data quality from FFPE tissues have emerged, but there remains significant room for improvement.

Here, we use whole-genome sequencing (WGS) data from matched Fresh Frozen (FF) and FFPE tissue samples to optimize a sensitive and precise FFPE single nucleotide variant (SNV) calling approach. We present methods to reduce the prevalence of false-positive SNVs by applying combinatorial techniques to five publicly available variant callers. We also introduce FFPolish, a novel variant classification method that efficiently classifies FFPE-specific false-positive variants.

Our combinatorial and statistical techniques improve precision and F1 scores compared to the results of publicly available tools when tested individually.

Keywords:Burkitt Lymphoma, Cervical Cancer, mRNA-Seq, miRNA-Seq, WGS, FFPE

In order to access controlled CGCI BLGSP and CGCI HTMCP-CC data, users must submit an application via dbGaP. To begin the application process, please view the information provided on the dbGaP Authorized Access Login Page under "dbGaP Data Download.

Supplemental Data

Additional Resources

Instructions for Data Download

Open Access Data

  1. Download the appropriate manifest file from the publication page
  2. Use the manifest file to download data using the GDC Data Transfer Tool (DTT) or the GDC API

Controlled Access Data

  1. Download the appropriate manifest file from the publication page
  2. Download a token from the GDC Data Portal
  3. Use the manifest file and token to download data using the GDC DTT or the GDC API

For assistance, please contact the GDC Help Desk: support@nci-gdc.datacommons.io.