Biomedical Literature Mining for Repurposing Laboratory Tests

Methods Mol Biol. 2022:2496:91-109. doi: 10.1007/978-1-0716-2305-3_5.

Abstract

Epidemiological studies identifying biological markers of disease state are valuable, but can be time-consuming, expensive, and require extensive intuition and expertise. Furthermore, not all hypothesized markers will be borne out in a study, suggesting that high-quality initial hypotheses are crucial. In this chapter, we describe a high-throughput pipeline to produce a ranked list of high-quality hypothesized biomarkers for diseases. We review an example use of this approach to generate a large number of candidate disease biomarker hypotheses derived from machine learning models, filter and rank them according to their potential novelty using text mining, and corroborate the most promising hypotheses with further statistical modeling. The example use of the pipeline uses a large electronic health record dataset and the PubMed corpus, to find several promising hypothesized laboratory tests with previously undocumented correlations to particular diseases.

Keywords: Biomarker discovery; Electronic health records; Epidemiology; Machine learning; Text mining.

Publication types

  • Review
  • Research Support, Non-U.S. Gov't
  • Research Support, N.I.H., Extramural

MeSH terms

  • Data Mining*
  • Electronic Health Records
  • Machine Learning*
  • Models, Statistical
  • Publications