A deep learning approach for orphan gene identification in moso bamboo (Phyllostachys edulis) based on the CNN + Transformer model

BMC Bioinformatics. 2022 May 5;23(1):162. doi: 10.1186/s12859-022-04702-1.

Abstract

Background: Orphan gene play an important role in the environmental stresses of many species and their identification is a critical step to understand biological functions. Moso bamboo has high ecological, economic and cultural value. Studies have shown that the growth of moso bamboo is influenced by various stresses. Several traditional methods are time-consuming and inefficient. Hence, the development of efficient and high-accuracy computational methods for predicting orphan genes is of great significance.

Results: In this paper, we propose a novel deep learning model (CNN + Transformer) for identifying orphan genes in moso bamboo. It uses a convolutional neural network in combination with a transformer neural network to capture k-mer amino acids and features between k-mer amino acids in protein sequences. The experimental results show that the average balance accuracy value of CNN + Transformer on moso bamboo dataset can reach 0.875, and the average Matthews Correlation Coefficient (MCC) value can reach 0.471. For the same testing set, the Balance Accuracy (BA), Geometric Mean (GM), Bookmaker Informedness (BM), and MCC values of the recurrent neural network, long short-term memory, gated recurrent unit, and transformer models are all lower than those of CNN + Transformer, which indicated that the model has the extensive ability for OG identification in moso bamboo.

Conclusions: CNN + Transformer model is feasible and obtains the credible predictive results. It may also provide valuable references for other related research. As our knowledge, this is the first model to adopt the deep learning techniques for identifying orphan genes in plants.

Keywords: Convolutional neural network; Deep learning; Moso bamboo; Orphan genes; Transformer neural network.

MeSH terms

  • Amino Acids / metabolism
  • Deep Learning*
  • Gene Expression Regulation, Plant*
  • Poaceae / genetics

Substances

  • Amino Acids