HCMB: A stable and efficient algorithm for processing the normalization of highly sparse Hi-C contact data

Honglong Wu; Xuebin Wang; Mengtian Chu; Dongfang Li; Lixin Cheng; Ke Zhou

doi:10.1016/j.csbj.2021.04.064

HCMB: A stable and efficient algorithm for processing the normalization of highly sparse Hi-C contact data

Comput Struct Biotechnol J. 2021 Apr 27:19:2637-2645. doi: 10.1016/j.csbj.2021.04.064. eCollection 2021.

Authors

Honglong Wu^{1

2}, Xuebin Wang², Mengtian Chu², Dongfang Li^{1

2}, Lixin Cheng³, Ke Zhou¹

Affiliations

¹ Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei 430000, China.
² BGI PathoGenesis Pharmaceutical Technology, BGI-Shenzhen, Shenzhen 518083, China.
³ Shenzhen People's Hospital, First Affiliated Hospital of Southern University of Science and Technology, Second Clinical Medicine College of Jinan University, Shenzhen 518020, China.

Abstract

The high-throughput genome-wide chromosome conformation capture (Hi-C) method has recently become an important tool to study chromosomal interactions where one can extract meaningful biological information including P(s) curve, topologically associated domains, A/B compartments, and other biologically relevant signals. Normalization is a critical pre-processing step of downstream analyses for the elimination of systematic and technical biases from chromatin contact matrices due to different mappability, GC content, and restriction fragment lengths. Especially, the problem of high sparsity puts forward a huge challenge on the correction, indicating the urgent need for a stable and efficient method for Hi-C data normalization. Recently, some matrix balancing methods have been developed to normalize Hi-C data, such as the Knight-Ruiz (KR) algorithm, but it failed to normalize contact matrices with high sparsity. Here, we presented an algorithm, Hi-C Matrix Balancing (HCMB), based on an iterative solution of equations, combining with linear search and projection strategy to normalize the Hi-C original interaction data. Both the simulated and experimental data demonstrated that HCMB is robust and efficient in normalizing Hi-C data and preserving the biologically relevant Hi-C features even facing very high sparsity. HCMB is implemented in Python and is freely accessible to non-commercial users at GitHub: https://github.com/HUST-DataMan/HCMB.

Keywords: Doubly stochastic matrix; Hi-C; Matrix balancing; Normalization; Sparsity.