Using combined evidence from replicates to evaluate ChIP-seq peaks

Bioinformatics. 2015 Sep 1;31(17):2761-9. doi: 10.1093/bioinformatics/btv293. Epub 2015 May 7.

Abstract

Motivation: Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) detects genome-wide DNA-protein interactions and chromatin modifications, returning enriched regions (ERs), usually associated with a significance score. Moderately significant interactions can correspond to true, weak interactions, or to false positives; replicates of a ChIP-seq experiment can provide co-localised evidence to decide between the two cases. We designed a general methodological framework to rigorously combine the evidence of ERs in ChIP-seq replicates, with the option to set a significance threshold on the repeated evidence and a minimum number of samples bearing this evidence.

Results: We applied our method to Myc transcription factor ChIP-seq datasets in K562 cells available in the ENCODE project. Using replicates, we could extend up to 3 times the ER number with respect to single-sample analysis with equivalent significance threshold. We validated the 'rescued' ERs by checking for the overlap with open chromatin regions and for the enrichment of the motif that Myc binds with strongest affinity; we compared our results with alternative methods (IDR and jMOSAiCS), obtaining more validated peaks than the former and less peaks than latter, but with a better validation.

Availability and implementation: An implementation of the proposed method and its source code under GPLv3 license are freely available at http://www.bioinformatics.deib.polimi.it/MSPC/ and http://mspc.codeplex.com/, respectively.

Contact: marco.morelli@iit.it

Supplementary information: Supplementary Material are available at Bioinformatics online.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Chromatin / genetics
  • Chromatin / metabolism*
  • Chromatin Immunoprecipitation / methods*
  • Computational Biology / methods
  • Data Interpretation, Statistical
  • Gene Expression Regulation
  • Genome, Human*
  • High-Throughput Nucleotide Sequencing*
  • Humans
  • K562 Cells
  • Nucleotide Motifs / genetics
  • Protein Binding
  • Protein Structure, Tertiary
  • Proto-Oncogene Proteins c-myc / genetics
  • Proto-Oncogene Proteins c-myc / metabolism
  • Quality Control
  • Reproducibility of Results
  • Sequence Analysis, DNA
  • Software
  • Transcription Factors / metabolism*
  • Ubiquitin-Protein Ligases / genetics
  • Ubiquitin-Protein Ligases / metabolism*

Substances

  • Chromatin
  • MYC protein, human
  • Proto-Oncogene Proteins c-myc
  • Transcription Factors
  • STUB1 protein, human
  • Ubiquitin-Protein Ligases