K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

Simon Orozco-Arias; Mariana S Candamil-Cortés; Paula A Jaimes; Johan S Piña; Reinel Tabares-Soto; Romain Guyot; Gustavo Isaza

doi:10.7717/peerj.11456

K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

PeerJ. 2021 May 19:9:e11456. doi: 10.7717/peerj.11456. eCollection 2021.

Authors

Simon Orozco-Arias^{1

2}, Mariana S Candamil-Cortés¹, Paula A Jaimes¹, Johan S Piña¹, Reinel Tabares-Soto³, Romain Guyot^{3

4}, Gustavo Isaza²

Affiliations

¹ Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.
² Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia.
³ Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.
⁴ Institut de Recherche pour le Développement, CIRAD, Univ. Montpellier, Montpellier, France.

Abstract

Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.

Keywords: Classification; Free-alignment approach; LTR retrotransposons; Machine learning; Plant genomes; Transposable elements; k-mer based method.

Grants and funding

Simon Orozco-Arias is supported by a Ph.D. grant from the Ministry of Science, Technology and Innovation (Minciencias) of Colombia, Grant Call 785/2017. The authors and publication fees were supported by Universidad Autónoma de Manizales, Manizales, Colombia under project 589-089. This work was supported by Ecos-Nord N°C21MA01 and STICAMSUD 21-STIC-13. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.