A foundational visual encoder for video understanding – Google Research Blog

Long Zhao, Senior Research Scientist, and Ting Liu, Senior Staff Software Engineer at Google Research, discuss the importance of analyzing the vast number of videos available on the web. Videos offer a unique perspective on the world, capturing movement and dynamic relationships between entities that static images cannot. Traditional image understanding models fall short when it comes to analyzing the complexity of videos, leading to the development of specialized models like VideoCLIP, InternVideo, VideoCoCa, and UMT.

To address the need for a single model for general-purpose video understanding, the authors introduce “VideoPrism: A Foundational Visual Encoder for Video Understanding.” VideoPrism is designed to handle various video understanding tasks such as classification, localization, retrieval, captioning, and question answering. The model is pre-trained on a massive dataset of 36 million high-quality video-text pairs and 582 million video clips with noisy or machine-generated parallel text. This diverse pre-training data allows VideoPrism to excel in tasks that require an understanding of both appearance and motion.

The authors describe the two-stage training approach used for VideoPrism, leveraging both text descriptions and visual content within a video. By combining these pre-training signals, VideoPrism achieves state-of-the-art performance across a wide range of video understanding tasks. The model outperforms existing foundation models on 30 out of 33 benchmarks, demonstrating its versatility and effectiveness.

Furthermore, the authors explore combining VideoPrism with large language models (LLMs) for video-language tasks, such as video-text retrieval, captioning, and question answering. The combined models set new benchmarks on vision-language tasks, highlighting VideoPrism’s compatibility with language models.

In scientific applications, VideoPrism surpasses domain-specific models on datasets from fields like ethology, behavioral neuroscience, and ecology. The model shows promise in transforming how scientists analyze video data across different domains.

In conclusion, VideoPrism represents a significant advancement in general-purpose video understanding, with implications for scientific discovery, education, and healthcare. The authors emphasize their commitment to responsible research guided by AI Principles and hope that VideoPrism will lead to future breakthroughs in AI and video analysis.

Source link