MLNet: a multi-level multimodal named entity recognition architecture

Hanming Zhai; Xiaojun Lv; Zhiwen Hou; Xin Tong; Fanliang Bu

doi:10.3389/fnbot.2023.1181143

MLNet: a multi-level multimodal named entity recognition architecture

Front Neurorobot. 2023 Jun 20:17:1181143. doi: 10.3389/fnbot.2023.1181143. eCollection 2023.

Authors

Hanming Zhai¹, Xiaojun Lv², Zhiwen Hou¹, Xin Tong¹, Fanliang Bu¹

Affiliations

¹ School of Information Network Security, People's Public Security University of China, Beijing, China.
² Institute of Computing Technology, China Academy of Railway Sciences Corporation Limited, Beijing, China.

Abstract

In the field of human-computer interaction, accurate identification of talking objects can help robots to accomplish subsequent tasks such as decision-making or recommendation; therefore, object determination is of great interest as a pre-requisite task. Whether it is named entity recognition (NER) in natural language processing (NLP) work or object detection (OD) task in the computer vision (CV) field, the essence is to achieve object recognition. Currently, multimodal approaches are widely used in basic image recognition and natural language processing tasks. This multimodal architecture can perform entity recognition tasks more accurately, but when faced with short texts and images containing more noise, we find that there is still room for optimization in the image-text-based multimodal named entity recognition (MNER) architecture. In this study, we propose a new multi-level multimodal named entity recognition architecture, which is a network capable of extracting useful visual information for boosting semantic understanding and subsequently improving entity identification efficacy. Specifically, we first performed image and text encoding separately and then built a symmetric neural network architecture based on Transformer for multimodal feature fusion. We utilized a gating mechanism to filter visual information that is significantly related to the textual content, in order to enhance text understanding and achieve semantic disambiguation. Furthermore, we incorporated character-level vector encoding to reduce text noise. Finally, we employed Conditional Random Fields for label classification task. Experiments on the Twitter dataset show that our model works to increase the accuracy of the MNER task.

Keywords: cross task; multi-head attention; multimodal named entity recognition; pre-training; short text.

Grants and funding

This study was supported by the National Natural Science Foundation of China-China State Railway Group Co., Ltd. Railway Basic Research Joint Fund (Grant No. U2268217) and the Scientific Funding for China Academy of Railway Sciences Corporation Limited (No. 2021YJ183). The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.