K2Mem: Discovering Discriminative K-mers From Sequencing Data for Metagenomic Reads Classification

IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):220-229. doi: 10.1109/TCBB.2021.3117406. Epub 2022 Feb 3.

Abstract

The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads to identify the species they contain. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of recall (the actual number of classified reads) the performances fall at around 50%. One of the reasons is the fact that the sequences in a sample can be very different from the corresponding reference genome, e.g., viral genomes are highly mutated. To address this issue, in this paper we study the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads. We evaluated the performance in different conditions against several other tools and the results showed an improved F-measure, especially when close reference genomes are not available. Availability: https://github.com.

MeSH terms

  • Algorithms
  • Gene Library
  • Genome, Viral
  • High-Throughput Nucleotide Sequencing*
  • Metagenome / genetics
  • Metagenomics*
  • Sequence Analysis, DNA
  • Software