An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems

Genes (Basel). 2023 May 14;14(5):1082. doi: 10.3390/genes14051082.

Abstract

The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).

Keywords: K-Means clustering; binning; metagenomics; river sediment; support vector machine.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Machine Learning
  • Metagenome / genetics
  • Microbiota* / genetics
  • Rivers*
  • Software

Grants and funding

N.C. is grateful to the Post Graduate School, ICAR-Indian Agricultural Research Institute, New Delhi, for providing financial assistance. All authors acknowledge the grant for CABin Scheme Network Project on Agricultural Bioinformatics and Computational Biology (F.No. Agril.Edn.14/2/2017-A&P dated 2 August 2017), received from the Indian Council of Agricultural Research, New Delhi.