Developing Customizable Cancer Information Extraction Modules for Pathology Reports Using CLAMP

Ergin Soysal; Jeremy L Warner; Jingqi Wang; Min Jiang; Krysten Harvey; Sandeep Kumar Jain; Xiao Dong; Hsing-Yi Song; Harish Siddhanamatha; Liwei Wang; Qi Dai; Qingxia Chen; Xianglin Du; Cui Tao; Ping Yang; Joshua Charles Denny; Hongfang Liu; Hua Xu

doi:10.3233/SHTI190383

Developing Customizable Cancer Information Extraction Modules for Pathology Reports Using CLAMP

Stud Health Technol Inform. 2019 Aug 21:264:1041-1045. doi: 10.3233/SHTI190383.

Authors

Ergin Soysal¹, Jeremy L Warner^{2

3

4}, Jingqi Wang¹, Min Jiang¹, Krysten Harvey³, Sandeep Kumar Jain⁵, Xiao Dong¹, Hsing-Yi Song¹, Harish Siddhanamatha¹, Liwei Wang⁶, Qi Dai², Qingxia Chen⁷, Xianglin Du⁸, Cui Tao¹, Ping Yang⁶, Joshua Charles Denny^{2

3}, Hongfang Liu⁶, Hua Xu¹

Affiliations

¹ School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas.
² Department of Medicine, Vanderbilt University, Nashville, Tennessee.
³ Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee.
⁴ Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, Tennessee.
⁵ Vanderbilt School of Medicine, Vanderbilt University, Nashville, Tennessee.
⁶ Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minnesota.
⁷ Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee.
⁸ School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas.

Abstract

Natural language processing (NLP) technologies have been successfully applied to cancer research by enabling automated phenotypic information extraction from narratives in electronic health records (EHRs) such as pathology reports; however, developing customized NLP solutions requires substantial effort. To facilitate the adoption of NLP in cancer research, we have developed a set of customizable modules for extracting comprehensive types of cancer-related information in pathology reports (e.g., tumor size, tumor stage, and biomarkers), by leveraging the existing CLAMP system, which provides user-friendly interfaces for building customized NLP solutions for individual needs. Evaluation using annotated data at Vanderbilt University Medical Center showed that CLAMP-Cancer could extract diverse types of cancer information with good F-measures (0.80-0.98). We then applied CLAMP-Cancer to an information extraction task at Mayo Clinic and showed that we can quickly build a customized NLP system with comparable performance with an existing system at Mayo Clinic. CLAMP-Cancer is freely available for academic use.

Keywords: Electronic Health Records; Information Storage and Retrieval; Natural Language Processing.

MeSH terms

Electronic Health Records
Humans
Information Storage and Retrieval*
Natural Language Processing
Neoplasms*
Research Report

Abstract

MeSH terms

Grants and funding