Unified Video-Language Pre-training with Synchronized Audio

Mo, Shentong; Wang, Haofan; Li, Huaxia; Tang, Xu

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2405

Computer Science > Computer Vision and Pattern Recognition

Title: Unified Video-Language Pre-training with Synchronized Audio

Authors: Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

(Submitted on 12 May 2024)

Abstract: Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.07202 [cs.CV]
	(or arXiv:2405.07202v1 [cs.CV] for this version)

Submission history

From: Shentong Mo [view email]
[v1] Sun, 12 May 2024 07:59:46 GMT (752kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.07202

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Unified Video-Language Pre-training with Synchronized Audio

Submission history