Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

川普在美国宾州巴特勒的一次演讲中遇刺_20240714川普在美国宾州巴特勒的一次演讲中遇刺_20240714
0

Based on the information provided, here is a summary for a news report:


Title: Breakthrough in Video Understanding: Memory-Based AI Agent Matches Top-Performing Models

Subheading: Researchers from Beijing Academy of General Artificial Intelligence and Peking University Develop VideoAgent, Surpassing Baselines by 30%

By [Your Name], [Your Affiliation]

Date: [Publication Date]

Beijing, China – In a significant advancement for the fields of computer vision and artificial intelligence, researchers from the Beijing Academy of General Artificial Intelligence and Peking University have introduced VideoAgent, a memory-based video understanding intelligent agent that matches the performance of the renowned Gemini 1.5 Pro model.

Video understanding remains a major challenge in computer vision and AI. While recent progress has been made through the end-to-end training of multimodal large language models, these models often struggle with long videos due to increased memory consumption and the inability of self-attention mechanisms to capture long-range relationships.

The new study, accepted by the European Conference on Computer Vision (ECCV) 2024, presents VideoAgent, which utilizes structured memory and the powerful reasoning and tool-use capabilities of large language models to extract key information from videos and answer questions about them.

The researchers designed two memory components for VideoAgent: a temporal memory that stores events every two seconds and an object memory that tracks and re-identifies objects and people in the video. These memories are built using state-of-the-art models like LaViLa for video-text generation, RT-DETR and Byte-track for object detection and tracking, and CLIP and DINO-v2 features for object re-identification.

VideoAgent’s performance was tested on three long video understanding datasets: EgoSchema, WorldQA, and NExT-QA. The results show that VideoAgent outperforms existing open-source multimodal large language models by 30% and is comparable to the top closed-source model, Gemini 1.5 Pro.

For example, on the EgoSchema dataset, VideoAgent achieved an accuracy of 60.2%, close to Gemini 1.5 Pro’s 63.2%. The model also demonstrated strong performance on the WorldQA dataset for both multiple-choice and open questions, thanks to its ability to integrate common knowledge, reasoning, and video memory.

The study’s findings are detailed in a paper available on arXiv (https://arxiv.org/abs/2403.11481) and further information can be found on the project’s homepage (https://videoagent.github.io/). The code for VideoAgent is also available on GitHub (https://github.com/YueFan1014/VideoAgent).

This research represents a significant step forward in video understanding and has the potential to revolutionize applications in areas such as surveillance, education, and entertainment.


Please note that you should replace [Your Name] and [Your Affiliation] with your actual name and the organization you are representing. Additionally, the date of publication should be added when the report is ready to be published.


read more

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注