Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

The landscape of artificial intelligence has seen yet another groundbreaking development with the launch of the MiniCPM-V 2.6 model, a Chinese-developed edge-side AI model that has set new standards in image, multi-image, and video understanding. This new model, developed by Zhipu AI, has achieved significant milestones, outperforming the renowned GPT-4V in comprehensive performance and marking a major leap forward in edge computing capabilities.

A Leap in Edge-Side AI Performance

The MiniCPM-V 2.6 model, introduced on August 6, boasts an impressive 8B parameter size, achieving state-of-the-art (SOTA) performance in models below 20B parameters. Notably, it is the first to surpass GPT-4V in core multimodal capabilities on the edge side, matching the capabilities of Gemini 1.5 Pro and GPT-4o mini in single-image understanding.

According to reports from Zhipu AI, the model, after int4 quantization, can operate within a 6G memory on the edge side, with a reasoning speed of up to 18 tokens/s, a 33% increase over its predecessor. The model supports various languages and is compatible with llama.cpp, ollama, and vllm推理 frameworks right out of the box.

Enhanced Multimodal Capabilities

The MiniCPM-V 2.6 model introduces several new features to edge-side computing, including real-time video understanding, multi-image joint understanding, visual analogy learning through multi-image ICL, and multi-image OCR capabilities. These features enable the model to leverage the rich AI sensor data available on edge devices, bringing enhanced functionality closer to the user.

For instance, the model can understand text captured by the camera while recording video, quickly identify amounts from multiple receipt photos, and even comprehend single or multiple meme images. This level of functionality is a significant step forward in making AI more accessible and user-friendly.

Token Density and Efficiency

One of the standout features of MiniCPM-V 2.6 is its high token density, which is double that of GPT-4o. Token density is a measure of the efficiency of a model, calculated as the number of encoded pixels per visual token. This high token density translates to higher operational efficiency, making the MiniCPM-V 2.6 one of the most efficient multimodal models available.

Benchmark Performance

The MiniCPM-V 2.6 has achieved remarkable results on various benchmark platforms. On OpenCompass, it has outperformed Gemini 1.5 Pro and GPT-4o mini in single-image understanding. In the Mantis-Eval leaderboard for multi-image joint understanding, it has achieved SOTA status among open-source models, surpassing GPT-4V. On the Video-MME leaderboard, it has achieved SOTA in video understanding on the edge side.

Additionally, the model has achieved SOTA performance in OCR on the OCRBench, continuing the tradition of the MiniCPM series as the most powerful edge-side OCR model. Its low hallucination rate on the Object HalBench also positions it ahead of many commercial models, including GPT-4o, GPT-4V, and Claude 3.5 Sonnet.

Real-Time Video Understanding on the Edge

One of the most exciting aspects of the MiniCPM-V 2.6 model is its real-time video understanding capability. This feature is particularly valuable for edge devices such as smartphones, PCs, AR devices, robots, and smart cars, which have built-in cameras capable of capturing rich multimodal input.

The model can accurately identify text in the scenes captured by the camera in real-time and can even summarize key information from long videos quickly. For example, its video OCR function can recognize dense text from a 48-second weather forecast video without any audio input.

Conclusion

The MiniCPM-V 2.6 model represents a significant advancement in edge-side AI capabilities. Its ability to surpass GPT-4V in key multimodal tasks and its enhanced efficiency and functionality make it a standout in the field of AI. As the demand for edge computing continues to grow, models like MiniCPM-V 2.6 will play a crucial role in shaping the future of AI.


read more

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注