##浙大李玺团队提出新方法:ScanFormer 迭代消除视觉冗余,提升指代表达理解效率
**机器之心报道**2024年8月20日
浙江大学李玺教授团队近日在指代表达理解(Referring Expression Comprehension, REC)领域取得突破性进展,提出了一种名为 ScanFormer 的新方法,通过粗到细的迭代感知框架,有效消除视觉冗余,提升了 REC 任务的效率和精度。该研究成果已发表在预印本平台 arXiv 上。
指代表达理解是视觉语言任务的基础,其目标是根据自然语言描述定位图像中被指代的目标。传统 REC 模型通常采用预训练的特征提取器,如 ResNet、Swin Transformer 或 ViT 等,对图像所有空间位置进行特征提取,导致计算量随图像分辨率快速增长。
李玺教授团队的研究发现,图像中存在大量低信息量的背景区域以及与指代表达无关的区域,对这些区域进行特征提取会增加计算量,但对有效特征提取毫无帮助。因此,团队提出了一种更加高效的解决方案:ScanFormer。
ScanFormer 采用了一种粗到细的迭代感知框架,通过图像金字塔逐层扫描,从低分辨率的粗尺度图像开始,逐步过滤掉指代表达无关的背景区域,使模型更多地关注前景和任务相关区域。
具体而言,ScanFormer 将 ViLT 模型沿深度维度分为两个部分:Encoder1 和 Encoder2。Encoder1 用于预测每个 patch 对应的下一个尺度的细粒度 patch 的选择情况,而 Encoder2 则提取特征并预测 bounding box。
随着尺度的增加,细粒度特征被引入,位置预测会更加准确,同时大部分无关的 patch 被丢弃,节省了大量计算。此外,每个尺度内部的 patch 具有双向注意力,可以进一步降低计算需求。
ScanFormer 的创新之处在于其动态 patch 选择机制,该机制能够根据前一尺度生成的选择因子来决定每个 patch 的选择情况,有效地将模型的注意力集中在关键区域。
该研究成果不仅在效率上取得了显著提升,而且在精度方面也表现出色。实验结果表明,ScanFormer 在多个公开数据集上都取得了最先进的性能。
李玺教授团队的这项研究为指代表达理解任务提供了新的思路,也为其他视觉语言任务提供了借鉴意义。未来,团队将继续探索更加高效、精准的视觉语言模型,推动人工智能技术的发展。
**论文标题:**ScanFormer: Referring Expression Comprehension by Iteratively Scanning
**论文链接:**https://arxiv.org/pdf/2406.18048
英语如下:
##ZJU’s Li Xi Team Breaks Through Bottlenecks in Visual LanguageUnderstanding!
**Keywords:** ZJU, ScanFormer, Referring Expression Comprehension
**News Content:**
**Machine Intelligence Report** August 20, 2024
A research team led by Professor Li Xi from Zhejiang University(ZJU) has made a breakthrough in the field of Referring Expression Comprehension (REC), proposing a new method called ScanFormer. This method effectively eliminates visual redundancythrough a coarse-to-fine iterative perception framework, enhancing the efficiency and accuracy of REC tasks. The research findings have been published on the preprint platform arXiv.
Referring Expression Comprehension is fundamental to visual language tasks, aiming to locate the referredobject in an image based on natural language descriptions. Traditional REC models typically employ pre-trained feature extractors, such as ResNet, Swin Transformer, or ViT, to extract features from all spatial locations of the image, leading to a rapidincrease in computational cost with image resolution.
Professor Li Xi’s team discovered that images contain a significant amount of low-information background areas and regions irrelevant to the referring expression. Extracting features from these areas increases computational burden without contributing to effective feature extraction. Therefore, the team proposed a more efficient solution: ScanFormer.
ScanFormer adopts a coarse-to-fine iterative perception framework, scanning through an image pyramid layer by layer. Starting from a low-resolution coarse-scale image, it gradually filters out background areas irrelevant to the referring expression, enabling the model to focus more on the foreground and task-related regions.
Specifically, ScanFormer divides the ViLT model into two parts along the depth dimension: Encoder1 and Encoder2. Encoder1 predicts the selection of fine-grained patches for each patch in the next scale, while Encoder2 extracts features and predicts the bounding box.
As the scale increases, fine-grained features are introduced,making location prediction more accurate. Simultaneously, most irrelevant patches are discarded, saving significant computation. Additionally, patches within each scale have bidirectional attention, further reducing computational requirements.
The innovation of ScanFormer lies in its dynamic patch selection mechanism. This mechanism determines the selection of each patch based on the selection factors generated from the previousscale, effectively concentrating the model’s attention on key regions.
This research has achieved significant improvements not only in efficiency but also in accuracy. Experimental results demonstrate that ScanFormer achieves state-of-the-art performance on multiple public datasets.
Professor Li Xi’s team’s research provides new insights into ReferringExpression Comprehension tasks and offers valuable reference for other visual language tasks. In the future, the team will continue exploring more efficient and accurate visual language models to drive the development of artificial intelligence technology.
**Paper
【来源】https://www.jiqizhixin.com/articles/2024-08-20-2
Views: 2