MetaProb 2: Metagenomic Reads Binning Based on Assembly Using Minimizers and K-Mers Statistics

J Comput Biol. 2021 Nov;28(11):1052-1062. doi: 10.1089/cmb.2021.0270. Epub 2021 Aug 26.

Abstract

Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. The major difficulties of taxonomic analysis are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, and sequencing errors. Microbial communities can be studied with reads clustering, a process referred to as genome binning. In this study, we present MetaProb 2 an unsupervised genome binning method based on reads assembly and probabilistic k-mers statistics. The novelties of MetaProb 2 are the use of minimizers to efficiently assemble reads into unitigs and a community detection algorithm based on graph modularity to cluster unitigs and to detect representative unitigs. The effectiveness of MetaProb 2 is demonstrated in both simulated and real datasets in comparison with state-of-art binning tools such as MetaProb, AbundanceBin, Bimeta, and MetaCluster. On real datasets, it is the only one capable of producing promising results while being parsimonious with computational resources.

Keywords: k-mers statistics; metagenomic binning; minimizers.

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Data Mining
  • Databases, Genetic
  • Metagenomics / methods*
  • Unsupervised Machine Learning