Correct machine learning on protein sequences: a peer-reviewing perspective

Brief Bioinform. 2016 Sep;17(5):831-40. doi: 10.1093/bib/bbv082. Epub 2015 Sep 26.

Abstract

Machine learning methods are becoming increasingly popular to predict protein features from sequences. Machine learning in bioinformatics can be powerful but carries also the risk of introducing unexpected biases, which may lead to an overestimation of the performance. This article espouses a set of guidelines to allow both peer reviewers and authors to avoid common machine learning pitfalls. Understanding biology is necessary to produce useful data sets, which have to be large and diverse. Separating the training and test process is imperative to avoid over-selling method performance, which is also dependent on several hidden parameters. A novel predictor has always to be compared with several existing methods, including simple baseline strategies. Using the presented guidelines will help nonspecialists to appreciate the critical issues in machine learning.

Keywords: evaluation; machine learning; posttranslational modification; predictor; protein sequence; training.

Publication types

  • Review

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Computational Biology
  • Humans
  • Machine Learning*
  • Proteins

Substances

  • Proteins