Differentiating between liver diseases by applying multiclass machine learning approaches to transcriptomics of liver tissue or blood-based samples

JHEP Rep. 2022 Aug 18;4(10):100560. doi: 10.1016/j.jhepr.2022.100560. eCollection 2022 Oct.

Abstract

Background & aims: Liver disease carries significant healthcare burden and frequently requires a combination of blood tests, imaging, and invasive liver biopsy to diagnose. Distinguishing between inflammatory liver diseases, which may have similar clinical presentations, is particularly challenging. In this study, we implemented a machine learning pipeline for the identification of diagnostic gene expression biomarkers across several alcohol-associated and non-alcohol-associated liver diseases, using either liver tissue or blood-based samples.

Methods: We collected peripheral blood mononuclear cells (PBMCs) and liver tissue samples from participants with alcohol-associated hepatitis (AH), alcohol-associated cirrhosis (AC), non-alcohol-associated fatty liver disease, chronic HCV infection, and healthy controls. We performed RNA sequencing (RNA-seq) on 137 PBMC samples and 67 liver tissue samples. Using gene expression data, we implemented a machine learning feature selection and classification pipeline to identify diagnostic biomarkers which distinguish between the liver disease groups. The liver tissue results were validated using a public independent RNA-seq dataset. The biomarkers were computationally validated for biological relevance using pathway analysis tools.

Results: Utilizing liver tissue RNA-seq data, we distinguished between AH, AC, and healthy conditions with overall accuracies of 90% in our dataset, and 82% in the independent dataset, with 33 genes. Distinguishing 4 liver conditions and healthy controls yielded 91% overall accuracy in our liver tissue dataset with 39 genes, and 75% overall accuracy in our PBMC dataset with 75 genes.

Conclusions: Our machine learning pipeline was effective at identifying a small set of diagnostic gene biomarkers and classifying several liver diseases using RNA-seq data from liver tissue and PBMCs. The methodologies implemented and genes identified in this study may facilitate future efforts toward a liquid biopsy diagnostic for liver diseases.

Lay summary: Distinguishing between inflammatory liver diseases without multiple tests can be challenging due to their clinically similar characteristics. To lay the groundwork for the development of a non-invasive blood-based diagnostic across a range of liver diseases, we compared samples from participants with alcohol-associated hepatitis, alcohol-associated cirrhosis, chronic hepatitis C infection, and non-alcohol-associated fatty liver disease. We used a machine learning computational approach to demonstrate that gene expression data generated from either liver tissue or blood samples can be used to discover a small set of gene biomarkers for effective diagnosis of these liver diseases.

Keywords: AC, alcohol-associated cirrhosis; AH, alcohol-associated hepatitis; AKR1B10, aldo-keto reductase family 1 member B10; BTM, blood transcription module; Classification; DE, differential expression; FPKM, fragments per kilobase of exon model per million reads mapped; GSEA, gene set-enrichment analysis; IG, information gain; IPA, Ingenuity Pathway Analysis; LR, logistic regression; LTCDS, liver tissue cell distribution system; LV, liver tissue; ML, machine learning; MMP, matrix metalloproteases; NAFLD, non-alcohol-associated fatty liver disease; PBMCs, peripheral blood mononuclear cells; RNA sequencing; RNA-seq, RNA sequencing; SCAHC, Southern California Alcoholic Hepatitis Consortium; SVM, support vector machine; TNF, tumor necrosis factor; alcohol-associated liver disease; biomarker discovery; kNN, k-nearest neighbors.