Dual Position Relationship Transformer for Image Captioning

Yaohan Wang; Wenhua Qian; Rencan Nie; Dan Xu; Jinde Cao; Pyoungwon Kim

doi:10.1089/big.2021.0262

Dual Position Relationship Transformer for Image Captioning

Big Data. 2022 Dec;10(6):515-527. doi: 10.1089/big.2021.0262. Epub 2022 Jan 4.

Authors

Yaohan Wang¹, Wenhua Qian^{1

2}, Rencan Nie^{1

2}, Dan Xu¹, Jinde Cao³, Pyoungwon Kim⁴

Affiliations

¹ Department of Information Science and Engineering, Yunnan University, Kunming, China.
² Department of Automation, Southeast University, Nanjing, China.
³ Department of Mathematics, Southeast University, Nanjing, China.
⁴ College of Education Incheon National University, Incheon, Korea.

PMID: 34981961
DOI: 10.1089/big.2021.0262

Abstract

Employing feature vectors extracted from the target detector has been shown to be effective in improving the performance of image captioning. However, it is considered that existing framework suffers from the deficiency of insufficient information extraction, such as positional relationships; it is very important to judge the relationship between objects. To fill this gap, we present a dual position relationship transformer (DPR) for image captioning; the architecture improves the image information extraction and description coding steps: it first calculates the relative position (RP) and absolute position (AP) between objects, and integrates the dual position relationship information into self-attention. Specifically, convolutional neural network (CNN) and faster R-CNN are applied to extract image features and target detection, then to calculate the RP and AP of the generated object boxes. The former is expressed in coordinate form, and the latter is calculated by sinusoidal encoding. In addition, to better model the sequence and time relationship in the description, DPR adopts long short-term memory to encode text vector. We conduct extensive experiments on the Microsoft COCO: Common Objects in Context (MSCOCO) image captioning data set that shows that our method achieves superior performance that Consensus-based Image Description Evaluation (CIDEr) increased to 114.6 after training 30 epochs and runs 2 times faster, compared with other competitive methods. The ablation study verifies the effectiveness of our proposed module.

Keywords: attention mechanism; faster R-CNN; image captioning; position relationship; transformer.

MeSH terms

Information Storage and Retrieval*
Neural Networks, Computer*