X 2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks

Yan Zeng; Xinsong Zhang; Hang Li; Jiawei Wang; Jipeng Zhang; Wangchunshu Zhou

doi:10.1109/TPAMI.2023.3339661

X ²-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3156-3168. doi: 10.1109/TPAMI.2023.3339661. Epub 2024 Apr 3.

Authors

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, Wangchunshu Zhou

PMID: 38090826
DOI: 10.1109/TPAMI.2023.3339661

Abstract

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X ²-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X ²-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X ²-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X ²-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X ²-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training.