Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

导语:近日,香港中文大学深圳和深圳大数据研究院的研究团队提出了一种名为LongLLaVA的混合架构多模态大模型。该模型在保持高吞吐量和低显存消耗的同时,实现了单卡千图推理,展现出在视频理解、高分辨率图像理解以及多模态智能体等领域的广阔应用前景。

正文:

一、研究背景

随着多模态大语言模型(MLLMs)的快速进步,其在各个应用领域中的显著能力得到了广泛认可。然而,多图像理解场景仍然是一个重要但尚未充分探索的方面。为了提升用户体验和拓展MLLMs的应用范围,研究团队致力于解决将MLLMs的应用场景扩展到理解更长的视频、更高分辨率的图像以及基于更多历史信息的决策的挑战。

二、LongLLaVA模型介绍

LongLLaVA是一种基于Mamba和Transformer混合架构的多模态大模型,其核心优势在于:

  1. 混合架构:LongLLaVA结合了Mamba和Transformer的优势,实现了高效的图像表示和长上下文处理。

  2. 数据构建:LongLLaVA在数据构建中考虑了多个图像之间的时间和空间依赖性,提高了模型在不同任务中的适应性。

  3. 训练策略:LongLLaVA采用渐进式训练策略,逐步提升模型处理多模态长上下文的能力。

三、实验结果

LongLLaVA在各种基准测试中取得了有竞争力的结果,并在单张80GB GPU上对1000张图像进行大海捞针评估时达到了近100%的准确率。具体实验结果如下:

  1. VNBench检索、计数和排序任务中,LongLLaVA表现领先。

  2. 单张80GB GPU上对1000张图像进行大海捞针评估时,LongLLaVA达到了近100%的准确率。

四、开源与社区发展

为了促进研究可复现和社区发展,团队将开源所有与LongLLaVA相关的模型、代码和数据集。项目地址:https://github.com/FreedomIntelligence/LongLLaVA

结语:

LongLLaVA作为首个混合架构多模态大模型,在保持高吞吐量和低显存消耗的同时,实现了单卡千图推理,展现出在视频理解、高分辨率图像理解以及多模态智能体等领域的广阔应用前景。相信随着该模型的不断优化和推广,其在实际应用中将发挥越来越重要的作用。


>>> Read more <<<

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注