英伟达LongVILA：长视频理解新突破

##英伟达推出“LongVILA”：长视频理解新突破，准确率近100%

**机器之心报道**

近年来，长上下文视觉语言模型（VLM）在处理长文档、长视频等复杂信息方面展现出巨大潜力。然而，现有技术通常采用简化方法，缺乏全面的解决方案。为了解决这一问题，英伟达联合 MIT、UC 伯克利、得克萨斯大学奥斯汀分校的研究者推出了 **LongVILA**，一个用于训练和部署长上下文 VLM 的全栈解决方案，涵盖系统设计、模型训练策略和数据集构建。

LongVILA 的核心在于其对**多模态序列并行 (MM-SP)** 的创新应用。该框架专门针对长上下文 VLM 的内存密集型训练需求，实现了高效且用户友好的训练环境。通过五阶段训练流程，LongVILA 能够将模型的多模态理解能力与长上下文能力相结合，并有效地处理长视频数据。

**LongVILA 的主要特点包括：**

* **支持 1024 帧视频，准确率接近 100%：** LongVILA 在1400 帧的大海捞针实验中实现了 99.5% 的准确率，相当于 274k 个 token 的上下文长度。
* **高效的 MM-SP 框架：** 该框架解决了 KV 缓存内存使用率的挑战，并通过优化策略实现了2.1 倍至 5.7 倍的加速，显著提升了训练效率。
* **全栈解决方案：** LongVILA 集系统、模型训练与数据集开发于一体，为长上下文 VLM 提供了完整的解决方案。

**LongVILA 的应用前景十分广阔：**

* **长视频理解：** LongVILA 可以更准确地理解长视频内容，并生成更详细的字幕和摘要。
* **多模态搜索：** LongVILA 可以用于多模态搜索，例如根据视频内容搜索相关信息。
* **虚拟助手：**LongVILA 可以为虚拟助手提供更强大的多模态理解能力，使其能够更好地理解用户意图。

**LongVILA 的出现标志着长上下文 VLM 技术取得了重大突破，将为人工智能领域带来新的发展机遇。** 研究者表示，他们将继续致力于优化 LongVILA，使其能够处理更长、更复杂的视频数据，并为更多应用场景提供支持。

**论文地址：** [https://arxiv.org/pdf/2408.10188](https://arxiv.org/pdf/2408.10188)

**代码地址：** [https://github.com/NVlabs/VILA/blob/main/LongVILA.md](https://github.com/NVlabs/VILA/blob/main/LongVILA.md)

英语如下：

##NVIDIA LongVILA: A Breakthrough in Long Video Understanding

**Keywords:** LongVILA, long video, VLM

**Machine Intelligence Report**

Inrecent years, long-context visual language models (VLMs) have shown great potential in handling complex information such as long documents and long videos. However, existingtechnologies often employ simplified methods, lacking comprehensive solutions. To address this issue, NVIDIA, in collaboration with researchers from MIT, UC Berkeley, and the University of Texasat Austin, has introduced **LongVILA**, a full-stack solution for training and deploying long-context VLMs, encompassing system design, model training strategies, and dataset construction.

The core of LongVILA lies in its innovativeapplication of **Multimodal Sequence Parallelism (MM-SP)**. This framework is specifically designed to address the memory-intensive training requirements of long-context VLMs, enabling an efficient and user-friendly training environment. Through a five-stage training process, LongVILA can combine the model’s multimodal understanding capabilities with long-context abilities, effectively processing long video data.

**Key features of LongVILA include:**

* **Supports 1024-frame videos with near-100% accuracy:** LongVILA achieved a99.5% accuracy rate in a “needle in a haystack” experiment with 1400 frames, equivalent to a context length of 274k tokens.
* **Efficient MM-SP framework:** This framework addresses the challenge of KV cache memory usage and achieves a 2.1x to 5.7x speedup through optimization strategies, significantly improving training efficiency.
* **Full-stack solution:** LongVILA integrates system, model training, and dataset development, providing a complete solution for long-context VLMs.

**LongVILA has broad application prospects:**

* **Longvideo understanding:** LongVILA can more accurately understand long video content and generate more detailed captions and summaries.
* **Multimodal search:** LongVILA can be used for multimodal search, such as searching for relevant information based on video content.
* **Virtual assistants:** LongVILA can provide virtual assistants with morepowerful multimodal understanding capabilities, enabling them to better understand user intent.

**The emergence of LongVILA marks a significant breakthrough in long-context VLM technology, bringing new development opportunities to the field of artificial intelligence.** Researchers say they will continue to optimize LongVILA to enable it to handle longer and more complex videodata and support more application scenarios.

**Paper Address:** [https://arxiv.org/pdf/2408.10188](https://arxiv.org/pdf/2408.10188)

**Code Address:** [https://github.com/NVlabs/VILA/blob/main/LongVILA.md](https://github.com/NVlabs/VILA/blob/main/LongVILA.md)

【来源】https://www.jiqizhixin.com/articles/2024-08-21-5