Essential gene prediction using limited gene essentiality information-An integrative semi-supervised machine learning strategy

PLoS One. 2020 Nov 30;15(11):e0242943. doi: 10.1371/journal.pone.0242943. eCollection 2020.

Abstract

Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Area Under Curve
  • Cluster Analysis
  • Genes, Essential / genetics*
  • Genetic Diseases, Inborn / diagnosis
  • Genetic Diseases, Inborn / genetics*
  • Humans
  • Machine Learning*
  • Metabolic Networks and Pathways / genetics*
  • Support Vector Machine

Grants and funding

We thank SERB, Department of Science and Technology, Govt. of India (DST/ICPS/EDA/2018) and DBT, Department of Biotechnology, Govt. of India (File No. BT/PR14958/BID/7/537/2015), for providing financial support to Ram Rup Sarkar. Sutanu Nandi acknowledges DST-INSPIRE for Senior Research Fellowship. Piyali Ganguli acknowledges the Council of Scientific & Industrial Research (CSIR) for the Senior Research Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.