A systematic review of the application of machine learning in the detection and classification of transposable elements

PeerJ. 2019 Dec 18:7:e8311. doi: 10.7717/peerj.8311. eCollection 2019.

Abstract

Background: Transposable elements (TEs) constitute the most common repeated sequences in eukaryotic genomes. Recent studies demonstrated their deep impact on species diversity, adaptation to the environment and diseases. Although there are many conventional bioinformatics algorithms for detecting and classifying TEs, none have achieved reliable results on different types of TEs. Machine learning (ML) techniques can automatically extract hidden patterns and novel information from labeled or non-labeled data and have been applied to solving several scientific problems.

Methodology: We followed the Systematic Literature Review (SLR) process, applying the six stages of the review protocol from it, but added a previous stage, which aims to detect the need for a review. Then search equations were formulated and executed in several literature databases. Relevant publications were scanned and used to extract evidence to answer research questions.

Results: Several ML approaches have already been tested on other bioinformatics problems with promising results, yet there are few algorithms and architectures available in literature focused specifically on TEs, despite representing the majority of the nuclear DNA of many organisms. Only 35 articles were found and categorized as relevant in TE or related fields.

Conclusions: ML is a powerful tool that can be used to address many problems. Although ML techniques have been used widely in other biological tasks, their utilization in TE analyses is still limited. Following the SLR, it was possible to notice that the use of ML for TE analyses (detection and classification) is an open problem, and this new field of research is growing in interest.

Keywords: Bioinformatics; Classification; Deep learning; Detection; Machine learning; Retrotransposons; Transposable elements.

Grants and funding

Simon Orozco-Arias is supported by a Ph.D. grant from Departamento Administrativo de Ciencia, Tecnología e Innovación de Colombia (Colciencias), Convocatoria 785/2017. The authors and publication fees were supported by the Universidad Autónoma de Manizales, Manizales, Colombia under project 589-089 and Romain Guyot was supported by the LMI BIO-INCA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.