Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

Beijing, March 25, 2025 – In a significant breakthrough in the field of multimodal large language models (MLLMs), researchers from Peking University and Alibaba’s Tongyi Wanxiang Lab have unveiled UFO, a novel approach to fine-grained visual perception. This innovative method allows MLLMs to perform precise image segmentation and object detection using a mere 16 tokens, eliminating the need for Segment Anything Model (SAM) or Grounding DINO.

The research, led by first author Hao Tang, a Ph.D. student at Peking University focusing on unified multimodal task modeling algorithms, and supervised by Professor Liwei Wang, a renowned professor at Peking University’s School of Intelligence Science and Technology, promises to revolutionize how MLLMs interact with and understand visual information. Professor Wang’s accolades include a Best Paper Award at NeurIPS 2024, a Distinguished Paper Award at ICLR 2023, and a Distinguished Paper Nomination at ICLR 2024, highlighting the team’s expertise and the potential impact of their work.

The core of UFO lies in its feature retrieval-based segmentation method. Instead of relying on complex models like SAM, UFO reframes the segmentation task as a similarity calculation between token features and image features. This elegant approach significantly reduces computational overhead and allows for highly efficient and accurate segmentation with minimal output.

UFO represents a paradigm shift in how MLLMs can perceive and understand visual information, explains Tang. By simplifying the segmentation process and reducing the reliance on external models, we’ve created a more efficient and accessible solution for a wide range of applications.

The implications of this research are far-reaching. From autonomous driving and robotics to medical image analysis and e-commerce, UFO’s ability to perform fine-grained visual perception with minimal resources opens up new possibilities for MLLM deployment in resource-constrained environments and real-time applications.

The team has made their work publicly available, including the research paper, open-source code, and pre-trained models:

The release of UFO marks a significant step forward in the development of more efficient and versatile MLLMs. The research community is eagerly anticipating further advancements building upon this groundbreaking work. The future of multimodal AI looks brighter than ever, thanks to innovations like UFO.

References:

  • Tang, H., et al. (2025). UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface. arXiv preprint arXiv:2503.01342.


>>> Read more <<<

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注