Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions

Grímur Hjörleifsson Eldjárn; Andrew Ramsay; Justin J J van der Hooft; Katherine R Duncan; Sylvia Soldatou; Juho Rousu; Rónán Daly; Joe Wandy; Simon Rogers

doi:10.1371/journal.pcbi.1008920

Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions

PLoS Comput Biol. 2021 May 4;17(5):e1008920. doi: 10.1371/journal.pcbi.1008920. eCollection 2021 May.

Authors

Grímur Hjörleifsson Eldjárn¹, Andrew Ramsay¹, Justin J J van der Hooft², Katherine R Duncan³, Sylvia Soldatou⁴, Juho Rousu⁵, Rónán Daly⁶, Joe Wandy⁶, Simon Rogers¹

Affiliations

¹ School of Computing Science, University of Glasgow, Glasgow, United Kingdom.
² Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
³ Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow, United Kingdom.
⁴ School of Pharmacy and Life Sciences, Robert Gordon University, Aberdeen, United Kingdom.
⁵ Department of Computer Science, Aalto University, Espoo, Finland.
⁶ Glasgow Polyomics, University of Glasgow, Glasgow, United Kingdom.

Abstract

Specialised metabolites from microbial sources are well-known for their wide range of biomedical applications, particularly as antibiotics. When mining paired genomic and metabolomic data sets for novel specialised metabolites, establishing links between Biosynthetic Gene Clusters (BGCs) and metabolites represents a promising way of finding such novel chemistry. However, due to the lack of detailed biosynthetic knowledge for the majority of predicted BGCs, and the large number of possible combinations, this is not a simple task. This problem is becoming ever more pressing with the increased availability of paired omics data sets. Current tools are not effective at identifying valid links automatically, and manual verification is a considerable bottleneck in natural product research. We demonstrate that using multiple link-scoring functions together makes it easier to prioritise true links relative to others. Based on standardising a commonly used score, we introduce a new, more effective score, and introduce a novel score using an Input-Output Kernel Regression approach. Finally, we present NPLinker, a software framework to link genomic and metabolomic data. Results are verified using publicly available data sets that include validated links.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Biosynthetic Pathways / genetics
Computational Biology
Data Mining
Databases, Factual
Databases, Genetic
Genetics, Microbial / statistics & numerical data*
Genome, Microbial
Genomics / statistics & numerical data*
Metabolomics / statistics & numerical data*
Microbiological Phenomena
Multigene Family
Regression Analysis
Software*

Grants and funding

BB/R022054/1/BB_/Biotechnology and Biological Sciences Research Council/United Kingdom