An explainable machine learning-driven proposal of pulmonary fibrosis biomarkers

Comput Struct Biotechnol J. 2023:21:2305-2315. doi: 10.1016/j.csbj.2023.03.043. Epub 2023 Mar 25.

Abstract

Pulmonary fibrosing diseases are in the very epicenter of biomedical research both due to their increasing prevalence and their association with SARS-CoV-2 infections. Research of idiopathic pulmonary fibrosis, the most lethal among the interstitial lung diseases, is in need for new biomarkers and potential disease targets, a goal that could be accelerated using machine learning techniques. In this study, we have used Shapley values to explain the decisions made by an ensemble learning model trained to classify samples to an either pulmonary fibrosis or steady state based on the expression values of deregulated genes. This process resulted in a full and a laconic set of features capable of separating phenotypes to an at least equal degree as previously published marker sets. Indicatively, a maximum increase of 6% in specificity and 5% in Mathew's correlation coefficient was achieved. Evaluation with an additional independent dataset showed our feature set having a greater generalization potential than the rest. Ultimately, the proposed gene lists are expected not only to serve as new sets of diagnostic marker elements, but also as a target pool for future research initiatives.

Keywords: Diagnostic biomarkers; Idiopathic pulmonary fibrosis (IPF); Machine learning; Omics data.