Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders

Anal Biochem. 2021 Nov 15:633:114416. doi: 10.1016/j.ab.2021.114416. Epub 2021 Oct 14.

Abstract

Efflux proteins are the transport proteins expressed in the plasma membrane, which are involved in the movement of unwanted toxic substances through specific efflux pumps. Several studies based on computational approaches have been proposed to predict transport proteins and thereby to understand the mechanism of the movement of ions across cell membranes. However, few methods were developed to identify efflux proteins. This paper presents an approach based on the contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) with the Support Vector Machine (SVM) classifier. BERT is the most effective pre-trained language model that performs exceptionally well on several Natural Language Processing (NLP) tasks. Therefore, the contextualized representations from BERT were implemented to incorporate multiple interpretations of identical amino acids in the sequence. A dataset of efflux proteins with annotations was first established. The feature vectors were extracted by transferring protein data through the hidden layers of the pre-trained model. Our proposed method was trained on complete training datasets to identify efflux proteins and achieved the accuracies of 94.15% and 87.13% in the independent tests on membrane and transport datasets, respectively. This study opens a research avenue for the implementation of contextualized word embeddings in Bioinformatics and Computational Biology.

Keywords: Contextualized word embeddings; Efflux proteins; Feature extraction; Support vector machine.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Carrier Proteins / analysis*
  • Computational Biology*
  • Natural Language Processing*
  • Support Vector Machine*

Substances

  • Carrier Proteins