Multimodal representation learning for tourism recommendation with two-tower architecture

PLoS One. 2024 Feb 23;19(2):e0299370. doi: 10.1371/journal.pone.0299370. eCollection 2024.

Abstract

Personalized recommendation plays an important role in many online service fields. In the field of tourism recommendation, tourist attractions contain rich context and content information. These implicit features include not only text, but also images and videos. In order to make better use of these features, researchers usually introduce richer feature information or more efficient feature representation methods, but the unrestricted introduction of a large amount of feature information will undoubtedly reduce the performance of the recommendation system. We propose a novel heterogeneous multimodal representation learning method for tourism recommendation. The proposed model is based on two-tower architecture, in which the item tower handles multimodal latent features: Bidirectional Long Short-Term Memory (Bi-LSTM) is used to extract the text features of items, and an External Attention Transformer (EANet) is used to extract image features of items, and connect these feature vectors with item IDs to enrich the feature representation of items. In order to increase the expressiveness of the model, we introduce a deep fully connected stack layer to fuse multimodal feature vectors and capture the hidden relationship between them. The model is tested on the three different datasets, our model is better than the baseline models in NDCG and precision.

MeSH terms

  • Electric Power Supplies
  • Humans
  • Learning*
  • Memory, Long-Term
  • Research Personnel
  • Tourism*

Grants and funding

The work was supported by the FDCT Funding Scheme for Postdoctoral Researchers of Higher Education Institutions, grant number 0003/2021/APD, under the supervision of Prof. Shengbin Liang.