Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

Shanghai, China – The rise of foundation models like CLIP, DINO, and SAM has revolutionized various fields by enabling task unification and fostering the development of multimodal large language models (LLMs). However, these models, trained with image-level supervision or weak semantic information, often fall short when handling fine-grained, dense prediction tasks, particularly in understanding document images containing dense text.

Addressing this limitation, a collaborative effort between Shanghai Jiao Tong University (SJTU) and Meituan has achieved a groundbreaking advancement in image-text alignment granularity. The result is TokenFD, a novel foundation model boasting three key advantages:

  • TokenIT: Industry’s First Token-Level Image-Text Dataset: TokenIT comprises 20 million publicly available images and 1.8 billion high-quality Token-Mask pairs. Each Byte Pair Encoding (BPE) subword in an image corresponds to a pixel-level mask. This dataset is five times larger than CLIP’s and contains 700 million more data pairs than SAM.

  • TokenFD: The First Fine-Grained Unified Foundation Model for Image-Text: TokenFD leverages the massive BPE-Mask pairs to create a fine-grained foundation model with a simple language encoding layer. It achieves true sharing of image tokens and language tokens within the same feature space, enabling token-level image-text interaction and supporting various downstream tasks.

  • Bridging the Modality Gap: TokenFD unlocks the semantic potential of images as text, enabling token-level modality alignment within large language models for the first time. This empowers dense multimodal document understanding tasks.

This new approach significantly narrows the gap between image and text modalities, opening up new possibilities for applications requiring a deep understanding of visual and textual information.

The research paper and demo are now available, with associated data, models, and code to be released. This development promises to accelerate progress in areas such as document understanding, visual question answering, and other multimodal tasks.

References

  • (Link to the research paper)
  • (Link to the demo)
  • (Link to the code repository)

Note: Replace the placeholder links with the actual URLs once they are available.


>>> Read more <<<

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注