苹果新突破：无需训练视频模型胜过SOTA

正文：
随着人工智能技术的飞速发展，视频生成领域在近期迎来了新的热潮。自从Sora发布以来，AI视频生成领域变得更加活跃，各大公司如即梦、Runway Gen-3、Luma AI以及快手可灵等纷纷推出各自的视频生成模型。这些模型在生成视频方面的表现远超以往，几乎无法被肉眼识别出是AI生成的。然而，这些模型背后需要庞大的、经过精细标注的视频数据集，这无疑增加了研发成本。

为了解决这一问题，苹果的研究团队提出了SlowFast-LLaVA（简称SF-LLaVA）模型。该模型基于字节团队开发的LLaVA-NeXT架构，无需额外微调，即可直接使用。研究团队受到动作识别领域双流网络的成功启发，为视频语言模型设计了一套新颖的SlowFast输入机制。

SF-LLaVA模型通过两种不同的观察速度来理解视频中的细节和运动，即“慢速眼”和“快速眼”。慢速眼通过低帧率提取特征，尽可能多地保留空间细节；快速眼则通过高帧率运行，以较大的空间池化步长降低视频分辨率，以模拟更大的时间上下文，更专注于理解动作的连贯性。

实验结果显示，SF-LLaVA在多个基准测试中均以显著的优势超越了现有免训练方法。与精心微调的SFT模型相比，SF-LLaVA能达到相同甚至更好的性能。该模型不仅捕捉到详细的空间语义，还能捕捉到更长的时间上下文，解决了现有视频语言模型的痛点。

SF-LLaVA模型的成功标志着AI视频生成领域的新突破，为未来的视频处理和分析提供了新的可能性。随着技术的不断进步，我们有理由相信，AI在视频领域的应用将更加广泛，为人们的生活带来更多的便利和惊喜。

英语如下：

Title: “Apple’s New Breakthrough: Video Models Triumph Over State-of-the-Art Without Training”

Keywords: Video Model, Apple Innovation, SOTA Surpass

News Content:
Title: Apple’s New Method Grants Video Models “Slow Vision” and “Fast Vision” to Outperform SOTA

With the rapid development of artificial intelligence technology, the field of video generation has recently seen a new wave of activity. Since the release of Sora, the AI video generation sector has become more active, with companies such as Jiameng, Runway Gen-3, Luma AI, and Kuaishou Qilin all releasing their own video generation models. These models outperform previous generations in video generation, almost indistinguishable from AI-generated content by the naked eye. However, these models require vast, meticulously annotated video datasets, which inevitably increases development costs.

To address this issue, Apple’s research team proposed the SlowFast-LLaVA (abbreviated as SF-LLaVA) model. This model is based on the LLaVA-NeXT architecture developed by the Byte team and can be used directly without additional fine-tuning. Inspired by the success of dual-stream networks in the field of action recognition, the research team designed a novel SlowFast input mechanism for video language models.

The SF-LLaVA model understands the details and motion in videos through two different observation speeds, known as “slow vision” and “fast vision.” The slow vision extracts features at a low frame rate, retaining as much spatial detail as possible; the fast vision operates at a high frame rate, reducing video resolution with a larger spatial pooling stride to simulate a larger temporal context, focusing more on understanding the continuity of actions.

Experimental results show that the SF-LLaVA model significantly outperforms existing untrained methods across multiple benchmark tests. Compared to finely tuned SFT models, SF-LLaVA achieves the same or better performance. The model not only captures detailed spatial semantics but also captures a longer temporal context, addressing the pain points of existing video language models.

The success of the SF-LLaVA model marks a new breakthrough in the field of AI video generation, opening up new possibilities for future video processing and analysis. As technology continues to advance, there is reason to believe that AI applications in the video sector will become more widespread, bringing more convenience and surprises to people’s lives.

【来源】https://www.jiqizhixin.com/articles/2024-08-11-6