The landscape of artificial intelligence has seen yet another groundbreaking development with the launch of the MiniCPM-V 2.6 model, a Chinese-developed edge-side AI model that has set new standards in image, multi-image, and video understanding. This new model, developed by Zhipu AI, has achieved significant milestones, outperforming the renowned GPT-4V in comprehensive performance and marking a major leap forward in edge computing capabilities.

A Leap in Edge-Side AI Performance

The MiniCPM-V 2.6 model, introduced on August 6, boasts an impressive 8B parameter size, achieving state-of-the-art (SOTA) performance in models below 20B parameters. Notably, it is the first to surpass GPT-4V in core multimodal capabilities on the edge side, matching the capabilities of Gemini 1.5 Pro and GPT-4o mini in single-image understanding.

According to reports from Zhipu AI, the model, after int4 quantization, can operate within a 6G memory on the edge side, with a reasoning speed of up to 18 tokens/s, a 33% increase over its predecessor. The model supports various languages and is compatible with llama.cpp, ollama, and vllm推理 frameworks right out of the box.

Enhanced Multimodal Capabilities

The MiniCPM-V 2.6 model introduces several new features to edge-side computing, including real-time video understanding, multi-image joint understanding, visual analogy learning through multi-image ICL, and multi-image OCR capabilities. These features enable the model to leverage the rich AI sensor data available on edge devices, bringing enhanced functionality closer to the user.

For instance, the model can understand text captured by the camera while recording video, quickly identify amounts from multiple receipt photos, and even comprehend single or multiple meme images. This level of functionality is a significant step forward in making AI more accessible and user-friendly.

Token Density and Efficiency

One of the standout features of MiniCPM-V 2.6 is its high token density, which is double that of GPT-4o. Token density is a measure of the efficiency of a model, calculated as the number of encoded pixels per visual token. This high token density translates to higher operational efficiency, making the MiniCPM-V 2.6 one of the most efficient multimodal models available.

Benchmark Performance

The MiniCPM-V 2.6 has achieved remarkable results on various benchmark platforms. On OpenCompass, it has outperformed Gemini 1.5 Pro and GPT-4o mini in single-image understanding. In the Mantis-Eval leaderboard for multi-image joint understanding, it has achieved SOTA status among open-source models, surpassing GPT-4V. On the Video-MME leaderboard, it has achieved SOTA in video understanding on the edge side.

Additionally, the model has achieved SOTA performance in OCR on the OCRBench, continuing the tradition of the MiniCPM series as the most powerful edge-side OCR model. Its low hallucination rate on the Object HalBench also positions it ahead of many commercial models, including GPT-4o, GPT-4V, and Claude 3.5 Sonnet.

Real-Time Video Understanding on the Edge

One of the most exciting aspects of the MiniCPM-V 2.6 model is its real-time video understanding capability. This feature is particularly valuable for edge devices such as smartphones, PCs, AR devices, robots, and smart cars, which have built-in cameras capable of capturing rich multimodal input.

The model can accurately identify text in the scenes captured by the camera in real-time and can even summarize key information from long videos quickly. For example, its video OCR function can recognize dense text from a 48-second weather forecast video without any audio input.

Conclusion

The MiniCPM-V 2.6 model represents a significant advancement in edge-side AI capabilities. Its ability to surpass GPT-4V in key multimodal tasks and its enhanced efficiency and functionality make it a standout in the field of AI. As the demand for edge computing continues to grow, models like MiniCPM-V 2.6 will play a crucial role in shaping the future of AI.


read more

Views: 0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注