Evaluating Plant Gene Models Using Machine Learning

Shriprabha R Upadhyaya; Philipp E Bayer; Cassandria G Tay Fernandez; Jakob Petereit; Jacqueline Batley; Mohammed Bennamoun; Farid Boussaid; David Edwards

doi:10.3390/plants11121619

Evaluating Plant Gene Models Using Machine Learning

Plants (Basel). 2022 Jun 20;11(12):1619. doi: 10.3390/plants11121619.

Authors

Shriprabha R Upadhyaya¹, Philipp E Bayer¹, Cassandria G Tay Fernandez¹, Jakob Petereit¹, Jacqueline Batley¹, Mohammed Bennamoun², Farid Boussaid³, David Edwards¹

Affiliations

¹ School of Biological Sciences, University of Western Australia, Perth, WA 6000, Australia.
² Department of Computer Science and Software Engineering, University of Western Australia, Perth, WA 6000, Australia.
³ Department of Electrical, Electronic and Computer Engineering, University of Western Australia, Perth, WA 6000, Australia.

Abstract

Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Pisum sativum Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91-0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.

Keywords: SHAP; XGBoost; gene models; machine learning; pea.

Grants and funding

DP200100762 and DE210100398/Australian Research Council