Using machine learning tools for protein database biocuration assistance

Caroline König; Ilmira Shaim; Alfredo Vellido; Enrique Romero; René Alquézar; Jesús Giraldo

doi:10.1038/s41598-018-28330-z

Using machine learning tools for protein database biocuration assistance

Sci Rep. 2018 Jul 5;8(1):10148. doi: 10.1038/s41598-018-28330-z.

Authors

Caroline König¹, Ilmira Shaim¹, Alfredo Vellido^{2

3}, Enrique Romero¹, René Alquézar¹, Jesús Giraldo^{4

5}

Affiliations

¹ IDEAI Research Center, Universitat Politècnica de Catalunya, UPC BarcelonaTech, 08034, Barcelona, Spain.
² IDEAI Research Center, Universitat Politècnica de Catalunya, UPC BarcelonaTech, 08034, Barcelona, Spain. avellido@cs.upc.edu.
³ Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), 08193, Cerdanyola del Vallès, Spain. avellido@cs.upc.edu.
⁴ Institut de Neurociències - Unitat de Bioestadìstica, Universitat Autònoma de Barcelona, 08193, Cerdanyola del Vallès, Spain. jesus.giraldo@uab.es.
⁵ Network Biomedical Research Center on Mental Health (CIBERSAM), Madrid, 28029, Spain. jesus.giraldo@uab.es.

Abstract

Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Data Curation / methods*
Databases, Protein*
Machine Learning*
Receptors, G-Protein-Coupled / metabolism
Support Vector Machine

Substances

Receptors, G-Protein-Coupled

Abstract

Publication types

MeSH terms

Substances

Grants and funding