Visual Cluster Grounding for Image Captioning

IEEE Trans Image Process. 2022:31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.

Abstract

Attention mechanisms have been extensively adopted in vision and language tasks such as image captioning. It encourages a captioning model to dynamically ground appropriate image regions when generating words or phrases, and it is critical to alleviate the problems of object hallucinations and language bias. However, current studies show that the grounding accuracy of existing captioners is still far from satisfactory. Recently, much effort is devoted to improving the grounding accuracy by linking the words to the full content of objects in images. However, due to the noisy grounding annotations and large variations of object appearance, such strict word-object alignment regularization may not be optimal for improving captioning performance. In this paper, to improve the performance of both grounding and captioning, we propose a novel grounding model which implicitly links the words to the evidence in the image. The proposed model encourages the captioner to dynamically focus on informative regions of the objects, which could be either discriminative parts or full object content. With slacked constraints, the proposed captioning model can capture correct linguistic characteristics and visual relevance, and then generate more grounded image captions. In addition, we propose a novel quantitative metric for evaluating the correctness of the soft attention mechanism by considering the overall contribution of all object proposals when generating certain words. The proposed grounding model can be seamlessly plugged into most attention-based architectures without introducing inference complexity. We conduct extensive experiments on Flickr30k (Young et al., 2014) and MS COCO datasets (Lin et al., 2014), demonstrating that the proposed method consistently improves image captioning in both grounding and captioning. Besides, the proposed attention evaluation metric shows better consistency with the captioning performance.

MeSH terms

  • Data Collection
  • Language*