Parallel Protein Community Detection in Large-scale PPI Networks Based on Multi-source Learning

IEEE/ACM Trans Comput Biol Bioinform. 2018 Aug 31. doi: 10.1109/TCBB.2018.2868088. Online ahead of print.

Abstract

Protein interactions constitute the fundamental building block of almost every life activity. Identifying protein communities from Protein-Protein Interaction (PPI) networks is essential to understand the principles of cellular organization and explore the causes of various diseases. It is critical to integrate multiple data resources to identify reliable protein communities that have biological significance and improve the performance of community detection methods for large-scale PPI networks. In this paper, we propose a Multi-source Learning based Protein Community Detection (MLPCD) algorithm by integrating Gene Expression Data (GED) and a parallel solution of MLPCD using cloud computing technology. GED under different conditions is integrated with the original PPI network to reconstruct a Weighted-PPI network. To flexibly identify protein communities of different scales, we define community modularity and functional cohesion measurements and detect protein communities from WPPI. In addition, we compare the detected communities with known protein complexes and evaluate the function enrichment of protein functional modules using Gene Ontology annotations. We implement a parallel version of MLPCD on the Apache Spark platform to enhance the performance of the algorithm. Extensive experimental results indicate the superiority and notable advantages of the MLPCD algorithm over the relevant algorithms in terms of accuracy and performance.