Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation

Steffen Albrecht; Tommaso Andreani; Miguel A Andrade-Navarro; Jean Fred Fontaine

doi:10.1371/journal.pone.0270043

Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation

PLoS One. 2022 Jul 1;17(7):e0270043. doi: 10.1371/journal.pone.0270043. eCollection 2022.

Authors

Steffen Albrecht¹, Tommaso Andreani^{1

2}, Miguel A Andrade-Navarro¹, Jean Fred Fontaine¹

Affiliations

¹ Institute of Organismic and Molecular Evolution (iOME), Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany.
² Institute of Molecular Biology, Mainz, Germany.

Abstract

Motivation: Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors.

Results: Imputations using machine learning models trained for each single cell, each ChIP protein target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real human data. Results on bulk data simulating single cells show that the imputations are single-cell specific as the imputed profiles are closer to the simulated cell than to other cells related to the same ChIP protein target and the same cell type. Simulations also show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways in 2 real human and mouse datasets. The SIMPA's interpretable imputation method allows users to gain a deep understanding of individual cells and, consequently, of sparse scChIP-seq datasets.

Availability and implementation: Our interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPA.

MeSH terms

Animals
Cluster Analysis
DNA
Genomics*
Machine Learning*
Mice
Sequence Analysis, DNA / methods

Substances

DNA

Grants and funding

The author(s) received no specific funding for this work.