Performance Comparison of Machine Learning Approaches on Hepatitis C Prediction Employing Data Mining Techniques

Bioengineering (Basel). 2023 Apr 17;10(4):481. doi: 10.3390/bioengineering10040481.

Abstract

Hepatitis C is a liver infection caused by the hepatitis C virus (HCV). Due to the late onset of symptoms, early diagnosis is difficult in this disease. Efficient prediction can save patients before permeant liver damage. The main objective of this study is to employ various machine learning techniques to predict this disease based on common and affordable blood test data to diagnose and treat patients in the early stages. In this study, six machine learning algorithms (Support Vector Machine (SVM), K-nearest Neighbors (KNN), Logistic Regression, decision tree, extreme gradient boosting (XGBoost), artificial neural networks (ANN)) were utilized on two datasets. The performances of these techniques were compared in terms of confusion matrix, precision, recall, F1 score, accuracy, receiver operating characteristics (ROC), and the area under the curve (AUC) to identify a method that is appropriate for predicting this disease. The analysis, on NHANES and UCI datasets, revealed that SVM and XGBoost (with the highest accuracy and AUC among the test models, >80%) can be effective tools for medical professionals using routine and affordable blood test data to predict hepatitis C.

Keywords: AUC; HCV; XGBoost; data mining; decision tree; hepatitis C virus; machine learning techniques; performance measurements.

Grants and funding

This work was funded by the Ministry of Science and Technology, Taiwan, Grant No. MOST 111-2221-E-027-132, 111-2119-M-027-001, 110-2622-E-027-02, 111-2221-E-027-134.