Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective

Katrin Sophie Bohnsack; Marika Kaden; Julia Abel; Thomas Villmann

doi:10.1109/TCBB.2022.3140873

Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):119-135. doi: 10.1109/TCBB.2022.3140873. Epub 2023 Feb 3.

Authors

Katrin Sophie Bohnsack, Marika Kaden, Julia Abel, Thomas Villmann

PMID: 34990369
DOI: 10.1109/TCBB.2022.3140873

Abstract

The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Machine Learning*
Mathematics
Sequence Alignment
Sequence Analysis