Explainable artificial intelligence model for identifying COVID-19 gene biomarkers

Fatma Hilal Yagin; İpek Balikci Cicek; Abedalrhman Alkhateeb; Burak Yagin; Cemil Colak; Mohammad Azzeh; Sami Akbulut

doi:10.1016/j.compbiomed.2023.106619

Explainable artificial intelligence model for identifying COVID-19 gene biomarkers

Comput Biol Med. 2023 Mar:154:106619. doi: 10.1016/j.compbiomed.2023.106619. Epub 2023 Feb 1.

Authors

Fatma Hilal Yagin¹, İpek Balikci Cicek², Abedalrhman Alkhateeb³, Burak Yagin⁴, Cemil Colak⁵, Mohammad Azzeh⁶, Sami Akbulut⁷

Affiliations

¹ Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey. Electronic address: hilal.yagin@inonu.edu.tr.
² Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey. Electronic address: ipek.balikci@inonu.edu.tr.
³ Software Engineering Department, King Hussein School for Computing Sciences, Amman, Jordan. Electronic address: a.lkhateeb@psut.edu.jo.
⁴ Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey. Electronic address: burak.yagin@inonu.edu.tr.
⁵ Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey. Electronic address: cemil.colak@inonu.edu.tr.
⁶ Data Science Department, King Hussein School for Computing Sciences, Amman, Jordan. Electronic address: m.azzeh@psut.edu.jo.
⁷ Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, 44280, Malatya, Turkey; Inonu University, Faculty of Medicine, Department of Surgery, 44280, Malatya, Turkey; Inonu University, Faculty of Medicine, Department of Public Health, 44280, Malatya, Turkey. Electronic address: akbulutsami@gmail.com.

Abstract

Aim: COVID-19 has revealed the need for fast and reliable methods to assist clinicians in diagnosing the disease. This article presents a model that applies explainable artificial intelligence (XAI) methods based on machine learning techniques on COVID-19 metagenomic next-generation sequencing (mNGS) samples.

Methods: In the data set used in the study, there are 15,979 gene expressions of 234 patients with COVID-19 negative 141 (60.3%) and COVID-19 positive 93 (39.7%). The least absolute shrinkage and selection operator (LASSO) method was applied to select genes associated with COVID-19. Support Vector Machine - Synthetic Minority Oversampling Technique (SVM-SMOTE) method was used to handle the class imbalance problem. Logistics regression (LR), SVM, random forest (RF), and extreme gradient boosting (XGBoost) methods were constructed to predict COVID-19. An explainable approach based on local interpretable model-agnostic explanations (LIME) and SHAPley Additive exPlanations (SHAP) methods was applied to determine COVID-19- associated biomarker candidate genes and improve the final model's interpretability.

Results: For the diagnosis of COVID-19, the XGBoost (accuracy: 0.930) model outperformed the RF (accuracy: 0.912), SVM (accuracy: 0.877), and LR (accuracy: 0.912) models. As a result of the SHAP, the three most important genes associated with COVID-19 were IFI27, LGR6, and FAM83A. The results of LIME showed that especially the high level of IFI27 gene expression contributed to increasing the probability of positive class.

Conclusions: The proposed model (XGBoost) was able to predict COVID-19 successfully. The results show that machine learning combined with LIME and SHAP can explain the biomarker prediction for COVID-19 and provide clinicians with an intuitive understanding and interpretability of the impact of risk factors in the model.

Keywords: COVID-19; Explainable artificial intelligence; LIME; SHAP; XGBoost.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
COVID-19* / diagnosis
COVID-19* / genetics
Genetic Markers
Humans
Neoplasm Proteins
Risk Factors

Substances

lime
Genetic Markers
FAM83A protein, human
Neoplasm Proteins