What You See Is What You Get: Multimodal RAG Is Coming to Us
By: Zhang Yingfeng
2024 can be considered ayear of explosive growth for multimodal large language models. The release of GPT-4o in May brought multimodal large language models further into our field of vision. Ifin 2023, multimodal applications were still stuck in traditional simple image searches, then in 2024, they truly began to understand multimodal data.
The image below shows representative multimodal large language models that emerged in 2024, both commercial and open-source. It can be seen that from the perspective of understanding images, significant progress has been made in 2024.
[Image: A table or chart showing the various multimodal large language models released in 2024, with their key features and capabilities.]
With this progress, will multimodal RAG also begin to be implemented and generate value? Let’s first look at some of the use cases for multimodal RAG.
The concept of multimodal RAG is not new. Shortly after the RAG concept became popular in 2023, there were descriptions of multimodal RAG scenarios. For example, for personal photo album and corporate promotional material search needs. However,these search needs were more about incorporating existing vector search use cases, such as image search and reverse image search, into multimodal RAG, without truly exploring the scene value of multimodal RAG from a business perspective.
As RAG technology has rapidly developed in 2024, more and more companies have come to see RAG as a standardconfiguration for large language models in B-end applications. Document question answering from within companies has unlocked a large number of use needs and scenarios. Among these documents, a considerable portion contains various complex chart content, which is essentially various multimodal data. How to effectively answer questions about these data has become one of the rigid sources ofdemand for mining the gold mine of internal company data.
For this type of data, one solution is to use visual models, utilizing generalized OCR technology to first recognize the layout of these multimodal documents, and then call corresponding models to process them according to different semantic blocks, as shown in the figure below.
[Image: Adiagram showing the process of using OCR technology to extract text from multimodal documents, including images and tables.]
In this process, the obtained images and tables are all typical multimodal data. Therefore, by using corresponding models to convert them into text data, the problem of understanding multimodal data is solved.
In principle, this technologycan also be divided into two generations:
- First generation: This generation uses various visual models to train separately for different types of chart data, converting them into text. For example, for table processing, there are table recognition models, and for flowcharts, pie charts, bar charts, and other corporate charts,corresponding models are also needed to handle them. These visual models are essentially classification models.
- Second generation: This generation uses generative models. Unlike the popular LLM, which uses the Decoder Only architecture, multimodal generative models based on Transformer typically use the Encoder-Decoder architecture. The input end of the Encoder is variouscharts, and the output of the Decoder is various texts.
Relying on this generalized OCR technology, a multimodal RAG system can be transformed into a standard RAG system. In our open-source and commercial RAGFlow, we have…
[Continue the article, providing details about RAGFlow, its features, andhow it addresses the challenges of multimodal RAG. Include real-world examples of how multimodal RAG is being used in different industries.]
Conclusion:
Multimodal RAG is poised to revolutionize how we interact with information. By enabling us to ask questions and get answers from a wide range of data sources, including images,tables, and text, it has the potential to unlock new insights and drive innovation across various industries. As the technology continues to evolve, we can expect to see even more exciting applications of multimodal RAG in the future.
References:
- [List of relevant research papers, articles, and websites.]
Views: 0