清华团队与AI巨头合作，开发创新视频问答技术夺冠军

在人工智能领域，清华大学深圳国际研究生院的师生团队与字节跳动等科技巨头共同探索，成功研发出一款基于ChatGPT、LLAMA、Vicuna等大型语言模型（Large Language Models, LLMs）的开源视频版GPT-4，这款创新工具不仅展现了强大的理解、生成和推理能力，还在CVPR’24长视频问答竞赛中脱颖而出，荣获冠军。

由清华大学本科生张颢继、硕士生王逸钦以及助理教授唐彦嵩带领的研究团队，利用LLMs在图片视觉理解上的成功经验，如MiniGPT-4、LLAVA等，将这些技术应用到视频领域。他们的工作得到了字节跳动视觉研究负责人冯佳时博士和电子工程系副教授代季峰博士的指导与支持。在团队的共同努力下，这款名为“开源视频版GPT-4”的创新工具应运而生。

该工具利用多模态大模型（Large Multimodal Models, LMMs）的强大功能，实现了快速记忆、实时问答的高效操作。在CVPR’24长视频问答竞赛中，它凭借出色的表现，成功夺得了冠军，展示了在视频理解和问答方面的卓越能力。

AIxiv专栏作为机器之心发布学术和技术内容的平台，一直以来致力于促进学术交流与传播。该平台接收并报道了超过2000篇内容，覆盖全球各大高校与企业的顶级实验室，为推动人工智能领域的学术进步和技术创新提供了重要支持。对于有意分享优秀成果的科研人员，AIxiv专栏欢迎投稿或联系报道，共同促进人工智能领域的知识传播与合作。

此次研发的成功，不仅体现了清华大学与字节跳动在人工智能领域深厚的技术积累和创新能力，也为未来人工智能技术在视频理解与问答领域的应用开辟了新的可能性，预示着人工智能在这一领域将有更广泛、深入的发展。

英语如下：

News Title: “Tsinghua Team Collaborates with AI Giants to Develop Innovative Video Q&A Technology and Claims the Championship”

Keywords: AIxiv Column, Long Video Q&A, Multi-modal Large Models

News Content: In the realm of artificial intelligence, a team of students and faculty from Tsinghua University’s Shenzhen International Graduate School, in collaboration with tech giants like ByteDance, has successfully developed an open-source video version of GPT-4, powered by large language models (LLMs) such as ChatGPT, LLAMA, and Vicuna. This pioneering tool showcases remarkable capabilities in understanding, generating, and reasoning, and has emerged victorious in the CVPR’24 Long Video Q&A competition.

Led by undergraduate Zhang Haoyi, graduate Wang Yiqin, and assistant professor Tang Yansong, the research team has leveraged the success of LLMs in image visual understanding, exemplified by MiniGPT-4 and LLAVA, to apply these technologies to the video domain. Their work has been guided and supported by Dr. Feng Jiashi, the head of visual research at ByteDance, and Dr. Dai Jifeng, an associate professor in the Department of Electrical Engineering. Through the collective effort of the team, this innovative tool known as the “open-source video version of GPT-4” was born.

Utilizing the powerful capabilities of multi-modal large models (LMMs), this tool enables efficient operations such as rapid memory and real-time Q&A. In the CVPR’24 Long Video Q&A competition, it demonstrated outstanding performance, securing the championship, and highlighting its exceptional abilities in video understanding and Q&A.

AIxiv, as a platform for Machine Intelligence to publish academic and technical content, has been dedicated to promoting academic exchanges and dissemination. It has received and reported over 2,000 pieces of content covering the top laboratories of universities and enterprises worldwide. It provides crucial support for advancing academic progress and technological innovation in the field of artificial intelligence. For researchers eager to share their achievements, AIxiv welcomes submissions or inquiries for reporting, fostering the dissemination of knowledge and collaboration in the field of artificial intelligence.

The success of this development not only underscores the deep technical prowess and innovative capabilities of Tsinghua University and ByteDance in the AI domain but also paves the way for future applications of AI technology in video understanding and Q&A, hinting at a broader and deeper development of AI in this field.

【来源】https://www.jiqizhixin.com/articles/2024-07-08-12