CNN-Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks that imitate human visual inspection

Dongpin Oh; J Seth Strattan; Junho K Hur; José Bento; Alexander Eckehart Urban; Giltae Song; J Michael Cherry

doi:10.1038/s41598-020-64655-4

CNN-Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks that imitate human visual inspection

Sci Rep. 2020 May 13;10(1):7933. doi: 10.1038/s41598-020-64655-4.

Authors

Dongpin Oh¹, J Seth Strattan², Junho K Hur³, José Bento⁴, Alexander Eckehart Urban², Giltae Song⁵, J Michael Cherry²

Affiliations

¹ School of Computer Science and Engineering, Pusan National University, Busan, 46241, South Korea.
² Department of Genetics, Stanford University, Stanford, 94305, USA.
³ School of Medicine, Kyung Hee University, Seoul, 02447, South Korea.
⁴ Department of Computer Science, Boston College, Chestnut Hill, Philadelphia, MA, 02467, USA.
⁵ School of Computer Science and Engineering, Pusan National University, Busan, 46241, South Korea. gsong@pusan.ac.kr.

Abstract

ChIP-seq is one of the core experimental resources available to understand genome-wide epigenetic interactions and identify the functional elements associated with diseases. The analysis of ChIP-seq data is important but poses a difficult computational challenge, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually uncover all the peaks. Recently developed convolutional neural networks (CNN), which are capable of achieving human-like classification accuracy, can be applied to this challenging problem. In this study, we design a novel supervised learning approach for identifying ChIP-seq peaks using CNNs, and integrate it into a software pipeline called CNN-Peaks. We use data labeled by human researchers who annotate the presence or absence of peaks in some genomic segments, as training data for our model. The trained model is then applied to predict peaks in previously unseen genomic segments from multiple ChIP-seq datasets including benchmark datasets commonly used for validation of peak calling methods. We observe a performance superior to that of previous methods.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Binding Sites
Chromatin Immunoprecipitation Sequencing* / methods
Computational Biology / methods*
Databases, Nucleic Acid
Epigenesis, Genetic
Epigenomics / methods
Histones / metabolism
Humans
Neural Networks, Computer*
Nucleotide Motifs
Protein Binding
Software*
Transcription Initiation Site

Substances

Histones