Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection

Sci Rep. 2018 Oct 24;8(1):15688. doi: 10.1038/s41598-018-33911-z.

Abstract

Screening and in silico modeling are critical activities for the reduction of experimental costs. They also speed up research notably and strengthen the theoretical framework, thus allowing researchers to numerically quantify the importance of a particular subset of information. For example, in fields such as cancer and other highly prevalent diseases, having a reliable prediction method is crucial. The objective of this paper is to classify peptide sequences according to their anti-angiogenic activity to understand the underlying principles via machine learning. First, the peptide sequences were converted into three types of numerical molecular descriptors based on the amino acid composition. We performed different experiments with the descriptors and merged them to obtain baseline results for the performance of the models, particularly of each molecular descriptor subset. A feature selection process was applied to reduce the dimensionality of the problem and remove noisy features - which are highly present in biological problems. After a robust machine learning experimental design under equal conditions (nested resampling, cross-validation, hyperparameter tuning and different runs), we statistically and significantly outperformed the best previously published anti-angiogenic model with a generalized linear model via coordinate descent (glmnet), achieving a mean AUC value greater than 0.96 and with an accuracy of 0.86 with 200 molecular descriptors, mixed from the three groups. A final analysis with the top-40 discriminative anti-angiogenic activity peptides is presented along with a discussion of the feature selection process and the individual importance of each molecular descriptors According to our findings, anti-angiogenic activity peptides are strongly associated with amino acid sequences SP, LSL, PF, DIT, PC, GH, RQ, QD, TC, SC, AS, CLD, ST, MF, GRE, IQ, CQ and HG.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence*
  • Angiogenesis Inhibitors / chemistry*
  • Computer Simulation*
  • Databases, Nucleic Acid
  • Humans
  • Linear Models*
  • Logistic Models
  • Machine Learning
  • Neoplasms / physiopathology
  • Neovascularization, Pathologic
  • Normal Distribution
  • Peptides / chemistry*

Substances

  • Angiogenesis Inhibitors
  • Peptides

Associated data

  • figshare/10.6084/m9.figshare.6016994