MetaMLP: A Fast Word Embedding Based Classifier to Profile Target Gene Databases in Metagenomic Samples

J Comput Biol. 2021 Nov;28(11):1063-1074. doi: 10.1089/cmb.2021.0273. Epub 2021 Oct 19.

Abstract

The functional profile of metagenomic samples enables improved understanding of microbial populations in the environment. Such analysis consists of assigning short sequencing reads to a particular functional category. Normally, manually curated databases are used for functional assignment, and genes are arranged into different classes. Sequence alignment has been widely used to profile metagenomic samples against curated databases. However, this method is time consuming and requires high computational resources. While several alignment-free methods based on k-mer composition have been developed in recent years, they still require large amounts of computer main memory. In this article, MetaMLP (Metagenomics Machine Learning Profiler), a machine learning method that represents sequences as numerical vectors (embeddings) and uses a simple one hidden layer neural network to profile functional categories, is developed. Unlike other methods, MetaMLP enables partial matching by using a reduced alphabet to build sequence embeddings from full and partial k-mers. MetaMLP is able to identify a slightly larger number of reads compared with DIAMOND (one of the fastest sequence alignment methods), as well as to perform accurate predictions with 0.99 precision and 0.99 recall. MetaMLP can process 100M reads in ∼10 minutes on a laptop computer, which is 50 times faster than DIAMOND.

Keywords: antibiotic resistance; metagenomic; short reads; word embedding.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Data Curation
  • Databases, Genetic
  • Machine Learning
  • Metagenomics / methods*
  • Sequence Alignment / methods*
  • Sequence Analysis, DNA