Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Proc Natl Acad Sci U S A. 2021 Dec 7;118(49):e2110828118. doi: 10.1073/pnas.2110828118.

Abstract

Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning-based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.

Keywords: deep learning; microbiome; multiple sequence alignments; protein homologous families; protein structure prediction.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods
  • Databases, Protein
  • Deep Learning
  • Ecosystem
  • Evolution, Molecular
  • Humans
  • Metagenome / genetics
  • Microbiota / genetics*
  • Neural Networks, Computer
  • Protein Conformation
  • Protein Folding
  • Proteins / chemistry
  • Sequence Alignment / methods*
  • Sequence Analysis, Protein / methods*
  • Sequence Homology

Substances

  • Proteins