GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models

Comput Biol Med. 2021 Apr:131:104259. doi: 10.1016/j.compbiomed.2021.104259. Epub 2021 Feb 7.

Abstract

Recently, language representation models have drawn a lot of attention in the field of natural language processing (NLP) due to their remarkable results. Among them, BERT (Bidirectional Encoder Representations from Transformers) has proven to be a simple, yet powerful language model that has achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context in which words appear. We utilized pre-trained BERT models to extract features from protein sequences for discriminating three families of glucose transporters: the major facilitator superfamily of glucose transporters (GLUTs), the sodium-glucose linked transporters (SGLTs), and the sugars will eventually be exported transporters (SWEETs). We treated protein sequences as sentences and transformed them into fixed-length meaningful vectors where a 768- or 1024-dimensional vector represents each amino acid. We observed that BERT-Base and BERT-Large models improved the performance by more than 4% in terms of average sensitivity and Matthews correlation coefficient (MCC), indicating the efficiency of this approach. We also developed a bidirectional transformer-based protein model (TransportersBERT) for comparison with existing pre-trained BERT models.

Keywords: BERT; Bidirectional encoder representations from transformers; Contextualized word embedding; Feature importance; Glucose transporter.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Glucose
  • Glucose Transport Proteins, Facilitative*
  • Language
  • Natural Language Processing*
  • Semantics

Substances

  • Glucose Transport Proteins, Facilitative
  • Glucose