A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data

Vineet K Raghu; Xiaoyu Ge; Arun Balajiee; Daniel J Shirer; Isha Das; Panayiotis V Benos; Panos K Chrysanthis

doi:10.1109/TCBB.2020.3019237

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data

IEEE/ACM Trans Comput Biol Bioinform. 2021 May-Jun;18(3):811-822. doi: 10.1109/TCBB.2020.3019237. Epub 2021 Jun 3.

Authors

Vineet K Raghu, Xiaoyu Ge, Arun Balajiee, Daniel J Shirer, Isha Das, Panayiotis V Benos, Panos K Chrysanthis

Abstract

Genome sequencing technologies have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand mechanisms of disease and predict the effects of medical interventions, high-throughput data must be integrated with demographic, phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods must infer relationships between these data types. We recently proposed a pipeline (CausalMGM) to achieve this. CausalMGM uses probabilistic graphical models to infer the relationships between variables in the data; however, CausalMGM's graphical structure learning algorithm can only handle small datasets efficiently. We propose a new methodology (piPref-Div) that selects the most informative variables for CausalMGM, enabling it to scale. We validate the efficacy of piPref-Div against other feature selection methods and demonstrate how the use of the full pipeline improves breast cancer outcome prediction and provides biologically interpretable views of gene expression data.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Breast Neoplasms / genetics
Computational Biology / methods*
Computer Graphics*
Databases, Factual*
Female
Gene Expression Profiling / methods*
Humans
Phenotype

Abstract

Publication types

MeSH terms

Grants and funding