Breakthrough AI Video Understanding Surpasses Gemini 1.5 Pro Leading the Way at ECCV 2024

Based on the information provided, here is a summary for a news report:

Title: Breakthrough in Video Understanding: Memory-Based AI Agent Matches Top-Performing Models

Subheading: Researchers from Beijing Academy of General Artificial Intelligence and Peking University Develop VideoAgent, Surpassing Baselines by 30%

By [Your Name], [Your Affiliation]

Date: [Publication Date]

Beijing, China – In a significant advancement for the fields of computer vision and artificial intelligence, researchers from the Beijing Academy of General Artificial Intelligence and Peking University have introduced VideoAgent, a memory-based video understanding intelligent agent that matches the performance of the renowned Gemini 1.5 Pro model.

Video understanding remains a major challenge in computer vision and AI. While recent progress has been made through the end-to-end training of multimodal large language models, these models often struggle with long videos due to increased memory consumption and the inability of self-attention mechanisms to capture long-range relationships.

The new study, accepted by the European Conference on Computer Vision (ECCV) 2024, presents VideoAgent, which utilizes structured memory and the powerful reasoning and tool-use capabilities of large language models to extract key information from videos and answer questions about them.

The researchers designed two memory components for VideoAgent: a temporal memory that stores events every two seconds and an object memory that tracks and re-identifies objects and people in the video. These memories are built using state-of-the-art models like LaViLa for video-text generation, RT-DETR and Byte-track for object detection and tracking, and CLIP and DINO-v2 features for object re-identification.

VideoAgent’s performance was tested on three long video understanding datasets: EgoSchema, WorldQA, and NExT-QA. The results show that VideoAgent outperforms existing open-source multimodal large language models by 30% and is comparable to the top closed-source model, Gemini 1.5 Pro.

For example, on the EgoSchema dataset, VideoAgent achieved an accuracy of 60.2%, close to Gemini 1.5 Pro’s 63.2%. The model also demonstrated strong performance on the WorldQA dataset for both multiple-choice and open questions, thanks to its ability to integrate common knowledge, reasoning, and video memory.

The study’s findings are detailed in a paper available on arXiv (https://arxiv.org/abs/2403.11481) and further information can be found on the project’s homepage (https://videoagent.github.io/). The code for VideoAgent is also available on GitHub (https://github.com/YueFan1014/VideoAgent).

This research represents a significant step forward in video understanding and has the potential to revolutionize applications in areas such as surveillance, education, and entertainment.

Please note that you should replace [Your Name] and [Your Affiliation] with your actual name and the organization you are representing. Additionally, the date of publication should be added when the report is ready to be published.