GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models

Syed Muazzam Ali Shah; Semmy Wellem Taju; Quang-Thai Ho; Trinh-Trung-Duong Nguyen; Yu-Yen Ou

doi:10.1016/j.compbiomed.2021.104259

GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models

Comput Biol Med. 2021 Apr:131:104259. doi: 10.1016/j.compbiomed.2021.104259. Epub 2021 Feb 7.

Authors

Syed Muazzam Ali Shah¹, Semmy Wellem Taju¹, Quang-Thai Ho¹, Trinh-Trung-Duong Nguyen¹, Yu-Yen Ou²

Affiliations

¹ Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan.
² Department of Computer Science & Engineering, Yuan Ze University, Chungli, 32003, Taiwan. Electronic address: yienou@gmail.com.

PMID: 33581474
DOI: 10.1016/j.compbiomed.2021.104259

Abstract

Recently, language representation models have drawn a lot of attention in the field of natural language processing (NLP) due to their remarkable results. Among them, BERT (Bidirectional Encoder Representations from Transformers) has proven to be a simple, yet powerful language model that has achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embeddings to capture the semantics and context in which words appear. We utilized pre-trained BERT models to extract features from protein sequences for discriminating three families of glucose transporters: the major facilitator superfamily of glucose transporters (GLUTs), the sodium-glucose linked transporters (SGLTs), and the sugars will eventually be exported transporters (SWEETs). We treated protein sequences as sentences and transformed them into fixed-length meaningful vectors where a 768- or 1024-dimensional vector represents each amino acid. We observed that BERT-Base and BERT-Large models improved the performance by more than 4% in terms of average sensitivity and Matthews correlation coefficient (MCC), indicating the efficiency of this approach. We also developed a bidirectional transformer-based protein model (TransportersBERT) for comparison with existing pre-trained BERT models.

Keywords: BERT; Bidirectional encoder representations from transformers; Contextualized word embedding; Feature importance; Glucose transporter.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Glucose
Glucose Transport Proteins, Facilitative*
Language
Natural Language Processing*
Semantics

Substances

Glucose Transport Proteins, Facilitative
Glucose