MetaTransformer: deep metagenomic sequencing read classification using self-attention models

Alexander Wichmann; Etienne Buschong; André Müller; Daniel Jünger; Andreas Hildebrandt; Thomas Hankeln; Bertil Schmidt

doi:10.1093/nargab/lqad082

MetaTransformer: deep metagenomic sequencing read classification using self-attention models

NAR Genom Bioinform. 2023 Sep 11;5(3):lqad082. doi: 10.1093/nargab/lqad082. eCollection 2023 Sep.

Authors

Alexander Wichmann¹, Etienne Buschong¹, André Müller¹, Daniel Jünger¹, Andreas Hildebrandt¹, Thomas Hankeln², Bertil Schmidt¹

Affiliations

¹ Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany.
² Institute of Organic and Molecular Evolution (iomE), Johannes Gutenberg University, J.-J. Becher-Weg 30A, 55128 Mainz, Rhineland-Palatinate, Germany.

Abstract

Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.