浙大李玺团队突破视觉语言理解瓶颈！

##浙大李玺团队提出ScanFormer：指代表达理解新方法，粗到细迭代消除视觉冗余

**机器之心报道**

近年来，指代表达理解（Referring Expression Comprehension, REC）作为基础的视觉语言任务，吸引了众多研究者的关注。REC 的目标是根据自然语言描述定位图中被指代的目标。现有的 REC 模型通常由三部分组成：视觉编码器、文本编码器和跨模态交互模块。然而，目前的研究大多集中在设计高效的跨模态交互模块以提升任务精度，而对视觉编码器的探索相对不足。

浙江大学李玺教授团队针对这一问题，提出了一种名为 ScanFormer的新方法，该方法通过粗到细的迭代感知框架，有效地消除了视觉冗余，提高了指代表达理解的效率和精度。

ScanFormer 的核心思想是利用图像金字塔，从低分辨率的粗尺度图像开始，逐步过滤掉指代表达无关的背景区域，使模型更多地关注前景和任务相关区域。具体而言，该方法采用统一文本和视觉模态的 ViLT 模型，并将其沿深度维度分为 Encoder1 和 Encoder2 两部分。

Encoder1 负责预测每个 patch 对应的下一个尺度的细粒度 patch 的选择情况，而 Encoder2 则提取特征并预测当前尺度的 bounding box。随着尺度的增加，细粒度特征被引入，位置预测会更加准确，同时大部分无关的 patch 被丢弃，节省了大量的计算资源。

此外，ScanFormer 还引入了尺度间的因果注意力机制，进一步降低了计算需求。每个尺度内部的 patch 具有双向注意力，同时会关注前序尺度所有的 patch 和文本特征。这种机制可以有效地利用不同尺度的信息，提高模型的理解能力。

该论文的作者均来自于浙江大学李玺教授团队，论文第一作者为博士生苏伟同学，通讯作者为李玺教授（IET Fellow，国家杰青）。李玺教授团队近年来在国际权威期刊（如 TPAMI、IJCV 等）和国际顶级学术会议（ICCV、CVPR、ECCV 等）上发表 180 余篇 CV/AIGC 相关的研究工作，和国内外知名高校、科研机构广泛开展合作。

ScanFormer 的提出为指代表达理解领域的研究提供了新的思路，其高效的视觉编码器设计和粗到细的迭代感知框架，为未来研究提供了宝贵的参考。

**论文标题：**ScanFormer: Referring Expression Comprehension by Iteratively Scanning

**论文链接：**https://arxiv.org/pdf/2406.18048

英语如下：

##ZJU Li Xi Team Breaks Through Bottleneck in Visual Language Understanding with ScanFormer

**Keywords:** Zhejiang University, ScanFormer, Referring Expression Comprehension

**News Content:**

**Machine Intelligence Report**

In recent years, Referring Expression Comprehension (REC) has attracted significant attention as a fundamental visual language task. RECaims to locate the referred object in an image based on a natural language description. Existing REC models typically consist of three parts: a visual encoder, a text encoder, and a cross-modal interaction module. However, current research mainly focuses on designing efficient cross-modal interaction modules to improve task accuracy, while exploration of visual encoders remains relatively limited.

Addressing this issue, Professor Li Xi’steam at Zhejiang University has proposed a novel method called ScanFormer. This method effectively eliminates visual redundancy through a coarse-to-fine iterative perception framework, enhancing the efficiency and accuracy of referring expression understanding.

The core idea of ScanFormer isto leverage image pyramids, starting from a low-resolution coarse-scale image and gradually filtering out background regions irrelevant to the referring expression, allowing the model to focus more on the foreground and task-related regions. Specifically, the method employs the ViLT model, which unifies text and visual modalities, and divides it intotwo parts along the depth dimension: Encoder1 and Encoder2.

Encoder1 is responsible for predicting the selection of fine-grained patches at the next scale for each patch, while Encoder2 extracts features and predicts the bounding box at the current scale. As the scale increases, fine-grained features are introduced, making locationprediction more accurate. Simultaneously, most irrelevant patches are discarded, saving significant computational resources.

Furthermore, ScanFormer introduces a causal attention mechanism between scales, further reducing computational demands. Patches within each scale have bidirectional attention, simultaneously focusing on all patches and text features from the preceding scale. This mechanism effectively leverages information from differentscales, enhancing the model’s comprehension ability.

The authors of this paper are all from Professor Li Xi’s team at Zhejiang University. The first author is PhD student Wei Su, and the corresponding author is Professor Li Xi (IET Fellow, National Distinguished Young Scholar). In recent years, Professor Li Xi’steam has published over 180 research papers related to CV/AIGC in prestigious international journals (e.g., TPAMI, IJCV) and top academic conferences (e.g., ICCV, CVPR, ECCV), and has engaged in extensive collaborations with renowned universities and research institutions worldwide.

The introduction of ScanFormer provides a new perspective for research in the field of referring expression understanding. Its efficient visual encoder design and coarse-to-fine iterative perception framework offer valuable reference for future research.

**Paper

【来源】https://www.jiqizhixin.com/articles/2024-08-20-2

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

浙大李玺团队突破视觉语言理解瓶颈！

作者智能小编

相关文章

Veo 2发布：视频创作，触手可及！

Zhipu GLM Unveils New Open-Source Model Claims World-Class Performance Launches “z.ai

智谱GLM模型升级，比肩世界先进！

发表回复取消回复

为您推荐