Integration of Multimodal Data from Disparate Sources for Identifying Disease Subtypes

Kaiyue Zhou; Bhagya Shree Kottoori; Seeya Awadhut Munj; Zhewei Zhang; Sorin Draghici; Suzan Arslanturk

doi:10.3390/biology11030360

Integration of Multimodal Data from Disparate Sources for Identifying Disease Subtypes

Biology (Basel). 2022 Feb 24;11(3):360. doi: 10.3390/biology11030360.

Authors

Kaiyue Zhou^{1

2}, Bhagya Shree Kottoori¹, Seeya Awadhut Munj¹, Zhewei Zhang², Sorin Draghici^{1

3}, Suzan Arslanturk¹

Affiliations

¹ Department of Computer Science, Wayne State University, Detroit, MI 48201, USA.
² Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.
³ Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI 48201, USA.

Abstract

Studies over the past decade have generated a wealth of molecular data that can be leveraged to better understand cancer risk, progression, and outcomes. However, understanding the progression risk and differentiating long- and short-term survivors cannot be achieved by analyzing data from a single modality due to the heterogeneity of disease. Using a scientifically developed and tested deep-learning approach that leverages aggregate information collected from multiple repositories with multiple modalities (e.g., mRNA, DNA Methylation, miRNA) could lead to a more accurate and robust prediction of disease progression. Here, we propose an autoencoder based multimodal data fusion system, in which a fusion encoder flexibly integrates collective information available through multiple studies with partially coupled data. Our results on a fully controlled simulation-based study have shown that inferring the missing data through the proposed data fusion pipeline allows a predictor that is superior to other baseline predictors with missing modalities. Results have further shown that short- and long-term survivors of glioblastoma multiforme, acute myeloid leukemia, and pancreatic adenocarcinoma can be successfully differentiated with an AUC of 0.94, 0.75, and 0.96, respectively.

Keywords: cancer progression; deep learning; imputation; multimodal data fusion.

Abstract

Grants and funding