智谱AI开源视频生成模型CogVideoX爆火

智谱AI是一家专注于人工智能技术的公司，近日，该公司宣布将其自研打造的大模型CogVideoX开源，这一举动在业界引起了巨大反响。短短几个小时内，CogVideoX在GitHub上获得了4000多个星标，显示出该模型在开源社区中的极高人气。

CogVideoX是一款基于3D变分自编码器（3D VAE）的视频压缩方法，它能够高效地处理包含空间和时间信息的视频数据。通过三维卷积技术，CogVideoX实现了更高的压缩率和更好的重建质量。此外，智谱AI还采用了上下文并行技术，以适应大规模视频处理的需求。

在模型结构方面，智谱AI使用了VAE的编码器将视频压缩至潜在空间，然后将潜在空间分割成块并展开成长序列嵌入。同时，智谱AI使用T5将文本输入编码为文本嵌入，并将两者沿序列维度拼接。拼接后的嵌入被送入专家Transformer块堆栈中处理，最后通过反向拼接嵌入来恢复原始潜在空间形状，并使用VAE进行解码以重建视频。

在数据筛选方面，智谱AI开发了负面标签来识别和排除低质量视频，确保了训练数据的质量。通过video-llama训练的过滤器，智谱AI标注并筛选了20,000个视频数据点，为视频生成模型的训练提供了高质量的素材。

智谱AI的开源模型CogVideoX的推出，不仅为研究者和开发者提供了宝贵的工具，也为国内视频生成领域的发展注入了新的活力。随着性能更强、参数量更大的模型的推出，未来视频生成技术将更加成熟，为各行各业带来更多的创新应用。

英语如下：

Title: CogVideoX, an Open-Source Video Generation Model by ZhiPROF AI, Surges in Popularity

Keywords: Open Source, Video Generation, AI Innovation

Content: ZhiPROF AI, a company specializing in artificial intelligence technology, recently announced the open-sourcing of its proprietary large model, CogVideoX. This move has sparked significant reactions within the industry. Within just a few hours, CogVideoX garnered over 4,000 stars on GitHub, indicating its high popularity within the open-source community.

CogVideoX is a video compression method based on a 3D Variational Autoencoder (3D VAE), capable of efficiently processing video data with spatial and temporal information. Through three-dimensional convolutional technology, CogVideoX achieves higher compression rates and better reconstruction quality. Additionally, ZhiPROF AI employs context-parallel technology to meet the needs of large-scale video processing.

In terms of model architecture, ZhiPROF AI uses a VAE encoder to compress video into a latent space, then segments the latent space into blocks and expands it into a long sequence of embeddings. At the same time, ZhiPROF AI utilizes T5 to encode text input into text embeddings, and concatenates both along the sequence dimension. The concatenated embeddings are then fed into an expert Transformer block stack for processing, and finally, the processed embeddings are reassembled to restore the original latent space shape, followed by decoding with a VAE to reconstruct the video.

Regarding data filtering, ZhiPROF AI developed negative labels to identify and exclude low-quality videos, ensuring the quality of training data. A video-llama trained filter labeled and filtered 20,000 video data points, providing high-quality materials for training the video generation model.

The release of the open-source model CogVideoX by ZhiPROF AI not only provides researchers and developers with valuable tools but also injects new vitality into the development of the video generation field in China. With the introduction of models that are more powerful and have a larger number of parameters, future video generation technology will become more mature, bringing about more innovative applications for various industries.

【来源】https://www.jiqizhixin.com/articles/2024-08-06-10