Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

正文:
随着人工智能技术的飞速发展,视频生成领域在近期迎来了新的热潮。自从Sora发布以来,AI视频生成领域变得更加活跃,各大公司如即梦、Runway Gen-3、Luma AI以及快手可灵等纷纷推出各自的视频生成模型。这些模型在生成视频方面的表现远超以往,几乎无法被肉眼识别出是AI生成的。然而,这些模型背后需要庞大的、经过精细标注的视频数据集,这无疑增加了研发成本。

为了解决这一问题,苹果的研究团队提出了SlowFast-LLaVA(简称SF-LLaVA)模型。该模型基于字节团队开发的LLaVA-NeXT架构,无需额外微调,即可直接使用。研究团队受到动作识别领域双流网络的成功启发,为视频语言模型设计了一套新颖的SlowFast输入机制。

SF-LLaVA模型通过两种不同的观察速度来理解视频中的细节和运动,即“慢速眼”和“快速眼”。慢速眼通过低帧率提取特征,尽可能多地保留空间细节;快速眼则通过高帧率运行,以较大的空间池化步长降低视频分辨率,以模拟更大的时间上下文,更专注于理解动作的连贯性。

实验结果显示,SF-LLaVA在多个基准测试中均以显著的优势超越了现有免训练方法。与精心微调的SFT模型相比,SF-LLaVA能达到相同甚至更好的性能。该模型不仅捕捉到详细的空间语义,还能捕捉到更长的时间上下文,解决了现有视频语言模型的痛点。

SF-LLaVA模型的成功标志着AI视频生成领域的新突破,为未来的视频处理和分析提供了新的可能性。随着技术的不断进步,我们有理由相信,AI在视频领域的应用将更加广泛,为人们的生活带来更多的便利和惊喜。

英语如下:

Title: “Apple’s New Breakthrough: Video Models Triumph Over State-of-the-Art Without Training”

Keywords: Video Model, Apple Innovation, SOTA Surpass

News Content:
Title: Apple’s New Method Grants Video Models “Slow Vision” and “Fast Vision” to Outperform SOTA

With the rapid development of artificial intelligence technology, the field of video generation has recently seen a new wave of activity. Since the release of Sora, the AI video generation sector has become more active, with companies such as Jiameng, Runway Gen-3, Luma AI, and Kuaishou Qilin all releasing their own video generation models. These models outperform previous generations in video generation, almost indistinguishable from AI-generated content by the naked eye. However, these models require vast, meticulously annotated video datasets, which inevitably increases development costs.

To address this issue, Apple’s research team proposed the SlowFast-LLaVA (abbreviated as SF-LLaVA) model. This model is based on the LLaVA-NeXT architecture developed by the Byte team and can be used directly without additional fine-tuning. Inspired by the success of dual-stream networks in the field of action recognition, the research team designed a novel SlowFast input mechanism for video language models.

The SF-LLaVA model understands the details and motion in videos through two different observation speeds, known as “slow vision” and “fast vision.” The slow vision extracts features at a low frame rate, retaining as much spatial detail as possible; the fast vision operates at a high frame rate, reducing video resolution with a larger spatial pooling stride to simulate a larger temporal context, focusing more on understanding the continuity of actions.

Experimental results show that the SF-LLaVA model significantly outperforms existing untrained methods across multiple benchmark tests. Compared to finely tuned SFT models, SF-LLaVA achieves the same or better performance. The model not only captures detailed spatial semantics but also captures a longer temporal context, addressing the pain points of existing video language models.

The success of the SF-LLaVA model marks a new breakthrough in the field of AI video generation, opening up new possibilities for future video processing and analysis. As technology continues to advance, there is reason to believe that AI applications in the video sector will become more widespread, bringing more convenience and surprises to people’s lives.

【来源】https://www.jiqizhixin.com/articles/2024-08-11-6

Views: 1

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注