Predicting gene phenotype by multi-label multi-class model based on essential functional features

Mol Genet Genomics. 2021 Jul;296(4):905-918. doi: 10.1007/s00438-021-01789-8. Epub 2021 Apr 29.

Abstract

Phenotype is one of the most significant concepts in genetics, which is used to describe all the characteristics of a research object that can be observed. Considering that phenotype reflects the integrated features of genotype and environment factors, it is hard to define phenotype characteristics, even difficult to predict unknown phenotypes. Restricted by current biological techniques, it is still quite expensive and time-consuming to obtain sufficient structural information of large-scale phenotype-associated genes/proteins. Various bioinformatics methods have been presented to solve such problem, and researchers have confirmed the efficacy and prediction accuracy of functional network-based prediction. But general functional descriptions have highly complicated inner structures for phenotype prediction. To further address this issue and improve the efficacy of phenotype prediction on more than ten kinds of phenotypes, we first extract functional enrichment features from GO and KEGG, and then use node2vec to learn functional embedding features of genes from a gene-gene network. All these features are analyzed by some feature selection methods (Boruta, minimum redundancy maximum relevance) to generate a feature list. Such list is fed into the incremental feature selection, incorporating some multi-label classifiers built by RAkEL and some classic base classifiers, to build an optimum multi-label multi-class classification model for phenotype prediction. According to recent researches, our method has indeed identified many literature-supported genes/proteins and their associated phenotypes, and even some candidate genes with re-assigned new phenotypes, which provide a new computational tool for the accurate and effective phenotypic prediction.

Keywords: Feature selection; Functional enrichment; Multi-label classification; Network embedding; Phenotype; RAkEL.

MeSH terms

  • Algorithms*
  • Computational Biology / methods*
  • Datasets as Topic
  • Gene Regulatory Networks / physiology
  • Genetic Association Studies / methods*
  • Metabolic Networks and Pathways / genetics
  • Phenotype
  • Proteins / chemistry
  • Proteins / genetics
  • Proteins / physiology
  • Saccharomyces cerevisiae / genetics
  • Saccharomyces cerevisiae / metabolism
  • Saccharomyces cerevisiae Proteins / chemistry
  • Saccharomyces cerevisiae Proteins / genetics
  • Saccharomyces cerevisiae Proteins / physiology
  • Structure-Activity Relationship

Substances

  • Proteins
  • Saccharomyces cerevisiae Proteins