Beijing, March 25, 2025 – In a significant breakthrough in the field of multimodal large language models (MLLMs), researchers from Peking University and Alibaba’s Tongyi Wanxiang Lab have unveiled UFO, a novel approach to fine-grained visual perception. This innovative method allows MLLMs to perform precise image segmentation and object detection using a mere 16 tokens, eliminating the need for Segment Anything Model (SAM) or Grounding DINO.
The research, led by first author Hao Tang, a Ph.D. student at Peking University focusing on unified multimodal task modeling algorithms, and supervised by Professor Liwei Wang, a renowned professor at Peking University’s School of Intelligence Science and Technology, promises to revolutionize how MLLMs interact with and understand visual information. Professor Wang’s accolades include a Best Paper Award at NeurIPS 2024, a Distinguished Paper Award at ICLR 2023, and a Distinguished Paper Nomination at ICLR 2024, highlighting the team’s expertise and the potential impact of their work.
The core of UFO lies in its feature retrieval-based segmentation method. Instead of relying on complex models like SAM, UFO reframes the segmentation task as a similarity calculation between token features and image features. This elegant approach significantly reduces computational overhead and allows for highly efficient and accurate segmentation with minimal output.
UFO represents a paradigm shift in how MLLMs can perceive and understand visual information, explains Tang. By simplifying the segmentation process and reducing the reliance on external models, we’ve created a more efficient and accessible solution for a wide range of applications.
The implications of this research are far-reaching. From autonomous driving and robotics to medical image analysis and e-commerce, UFO’s ability to perform fine-grained visual perception with minimal resources opens up new possibilities for MLLM deployment in resource-constrained environments and real-time applications.
The team has made their work publicly available, including the research paper, open-source code, and pre-trained models:
- Paper: https://arxiv.org/abs/2503.01342
- Code: https://github.com/nnnth/UFO
- Model: https://huggingface.co/kanashi6/UFO
The release of UFO marks a significant step forward in the development of more efficient and versatile MLLMs. The research community is eagerly anticipating further advancements building upon this groundbreaking work. The future of multimodal AI looks brighter than ever, thanks to innovations like UFO.
References:
- Tang, H., et al. (2025). UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface. arXiv preprint arXiv:2503.01342.
Views: 0