(m, n)-mer-a simple statistical feature for sequence classification

Amanda Araújo Serrão de Andrade; Marco Grivet; Otávio Brustolini; Ana Tereza Ribeiro Vasconcelos

doi:10.1093/bioadv/vbad088

(m, n)-mer-a simple statistical feature for sequence classification

Bioinform Adv. 2023 Jul 11;3(1):vbad088. doi: 10.1093/bioadv/vbad088. eCollection 2023.

Authors

Amanda Araújo Serrão de Andrade¹, Marco Grivet², Otávio Brustolini¹, Ana Tereza Ribeiro Vasconcelos¹

Affiliations

¹ Bioinformatics Laboratory (LABINFO), National Laboratory for Scientific Computing, Av. Getulio Vargas, 333-Quitandinha, 25651-076, Rio de Janeiro, Brazil.
² Pontifícia Universidade Católica do Rio de Janeiro, Rua Marquês de São Vicente 225, Gávea, 22451-900, Rio de Janeiro, Brazil.

Abstract

Summary: The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (m, n)-mer frequency features are related to the highest performance metrics and often statistically outperformed the k-mers. Here, the (m, n)-mer frequencies improved performance for classifying smaller sequence lengths (as short as 300 bp) and yielded higher metrics when using short values of k (ranging from 2 to 4). Therefore, we present the (m, n)-mers frequencies to the scientific community as a feature that seems to be quite effective in identifying complex discriminatory patterns and classifying polyphyletic sequence groups.

Availability and implementation: The (m, n)-mer algorithm is released as an R package within the CRAN project (https://cran.r-project.org/web/packages/mnmer) and is also available at https://github.com/labinfo-lncc/mnmer.

Supplementary information: Supplementary data are available at Bioinformatics Advances online.