Gene Mutation Classification through Text Evidence Facilitating Cancer Tumour Detection

Meenu Gupta; Hao Wu; Simrann Arora; Akash Gupta; Gopal Chaudhary; Qiaozhi Hua

doi:10.1155/2021/8689873

Gene Mutation Classification through Text Evidence Facilitating Cancer Tumour Detection

J Healthc Eng. 2021 Jul 27:2021:8689873. doi: 10.1155/2021/8689873. eCollection 2021.

Authors

Meenu Gupta¹, Hao Wu², Simrann Arora³, Akash Gupta³, Gopal Chaudhary³, Qiaozhi Hua⁴

Affiliations

¹ Department of Computer Science and Engineering, Chandigarh University, Ajitgarh, Punjab, India.
² Digital Zhejiang Technology Operations Co., Ltd., Hangzhou, China.
³ Bharati Vidyapeeth's College of Engineering, New Delhi, India.
⁴ Computer School, Hubei University of Arts and Science, Xiangyang 441000, China.

Abstract

A cancer tumour consists of thousands of genetic mutations. Even after advancement in technology, the task of distinguishing genetic mutations, which act as driver for the growth of tumour with passengers (Neutral Genetic Mutations), is still being done manually. This is a time-consuming process where pathologists interpret every genetic mutation from the clinical evidence manually. These clinical shreds of evidence belong to a total of nine classes, but the criterion of classification is still unknown. The main aim of this research is to propose a multiclass classifier to classify the genetic mutations based on clinical evidence (i.e., the text description of these genetic mutations) using Natural Language Processing (NLP) techniques. The dataset for this research is taken from Kaggle and is provided by the Memorial Sloan Kettering Cancer Center (MSKCC). The world-class researchers and oncologists contribute the dataset. Three text transformation models, namely, CountVectorizer, TfidfVectorizer, and Word2Vec, are utilized for the conversion of text to a matrix of token counts. Three machine learning classification models, namely, Logistic Regression (LR), Random Forest (RF), and XGBoost (XGB), along with the Recurrent Neural Network (RNN) model of deep learning, are applied to the sparse matrix (keywords count representation) of text descriptions. The accuracy score of all the proposed classifiers is evaluated by using the confusion matrix. Finally, the empirical results show that the RNN model of deep learning has performed better than other proposed classifiers with the highest accuracy of 70%.

Publication types

Research Support, Non-U.S. Gov't
Retracted Publication

MeSH terms

Humans
Machine Learning
Mutation / genetics
Natural Language Processing*
Neoplasms* / diagnosis
Neoplasms* / genetics
Neural Networks, Computer