Validation of a Semiautomated Natural Language Processing-Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance

Zhengyi Deng; Kanhua Yin; Yujia Bao; Victor Diego Armengol; Cathy Wang; Ankur Tiwari; Regina Barzilay; Giovanni Parmigiani; Danielle Braun; Kevin S Hughes

doi:10.1200/CCI.19.00043

Validation of a Semiautomated Natural Language Processing-Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance

JCO Clin Cancer Inform. 2019 Aug:3:1-9. doi: 10.1200/CCI.19.00043.

Authors

Zhengyi Deng¹, Kanhua Yin¹, Yujia Bao², Victor Diego Armengol¹, Cathy Wang^{3

4}, Ankur Tiwari¹, Regina Barzilay², Giovanni Parmigiani^{3

4}, Danielle Braun^{3

4}, Kevin S Hughes^{1

5}

Affiliations

¹ Massachusetts General Hospital, Boston, MA.
² Massachusetts Institute of Technology, Boston, MA.
³ Harvard TH Chan School of Public Health, Boston, MA.
⁴ Dana-Farber Cancer Institute, Boston, MA.
⁵ Harvard Medical School, Boston, MA.

Abstract

Purpose: Quantifying the risk of cancer associated with pathogenic mutations in germline cancer susceptibility genes-that is, penetrance-enables the personalization of preventive management strategies. Conducting a meta-analysis is the best way to obtain robust risk estimates. We have previously developed a natural language processing (NLP) -based abstract classifier which classifies abstracts as relevant to penetrance, prevalence of mutations, both, or neither. In this work, we evaluate the performance of this NLP-based procedure.

Materials and methods: We compared the semiautomated NLP-based procedure, which involves automated abstract classification and text mining, followed by human review of identified studies, with the traditional procedure that requires human review of all studies. Ten high-quality gene-cancer penetrance meta-analyses spanning 16 gene-cancer associations were used as the gold standard by which to evaluate the performance of our procedure. For each meta-analysis, we evaluated the number of abstracts that required human review (workload) and the ability to identify the studies that were included by the authors in their quantitative analysis (coverage).

Results: Compared with the traditional procedure, the semiautomated NLP-based procedure led to a lower workload across all 10 meta-analyses, with an overall 84% reduction (2,774 abstracts v 16,941 abstracts) in the amount of human review required. Overall coverage was 93%-we are able to identify 132 of 142 studies-before reviewing references of identified studies. Reasons for the 10 missed studies included blank and poorly written abstracts. After reviewing references, nine of the previously missed studies were identified and coverage improved to 99% (141 of 142 studies).

Conclusion: We demonstrated that an NLP-based procedure can significantly reduce the review workload without compromising the ability to identify relevant studies. NLP algorithms have promising potential for reducing human efforts in the literature review process.

Publication types

Meta-Analysis
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Biomarkers, Tumor*
Computational Biology / methods
Genetic Predisposition to Disease*
Humans
Natural Language Processing*
Neoplasms / genetics*
Penetrance*
Reproducibility of Results
Workflow

Substances

Biomarkers, Tumor

Abstract

Publication types

MeSH terms

Substances

Grants and funding