上海AI实验室发布千亿字符高质量语料“万卷CC”

上海人工智能实验室（简称上海AI实验室）近日发布了一项名为“万卷CC”（WanJuan-CC）的高质量大模型预训练语料。这是“大模型语料数据联盟”今年首发的开源语料，其规模之大、质量之高，在业界和学界引起了广泛关注。

据悉，首批开源的“万卷CC”语料覆盖了过去十年互联网上的公开内容，总量达到1千亿字符（100B token），约400GB的高质量英文数据。如此庞大的数据集，不仅为AI模型的训练提供了丰富的原料，也为AI的研究和发展提供了强有力的支撑。

作为高质量的大模型预训练语料，万卷CC在内容的丰富性和多样性上具有显著优势。它不仅包含了各类文本信息，如新闻、博客、论坛讨论等，还覆盖了多种语言、多种领域的知识，使得AI模型在训练过程中能够更好地理解和应对复杂场景。

上海AI实验室此次发布的“万卷CC”语料，将为学界和业界提供大规模、高质量的数据支撑，助力构建更智能可靠的AI大模型。这对于推动我国AI技术的发展，提高AI应用的智能化水平，具有重要的意义。

同时，开源的万卷CC语料也体现了上海AI实验室致力于AI领域学术研究的决心。通过共享高质量的数据资源，上海AI实验室希望能够推动AI领域的创新，促进学术交流与合作，为全球AI技术的发展贡献力量。

总之，上海AI实验室发布的“万卷CC”高质量大模型预训练语料，将为我国乃至全球的AI研究和发展提供强大的数据支持。在未来的AI技术发展中，我们有理由相信，万卷CC将为构建更智能、更可靠的AI大模型发挥重要作用。

英语如下：

# Shanghai AI Lab Releases Trillion-Character Quality Corpus “WanJuan-CC”

**Keywords:** Shanghai AI Lab, Releases WanJuan-CC, High-Quality Corpus

**News Content:**

The Shanghai Artificial Intelligence Laboratory (hereinafter referred to as the Shanghai AI Lab) recently released a high-quality large model pre-training corpus called “WanJuan-CC.” This is the first open-source corpus launched by the “Large Model Corpus Data Alliance” this year. Its large scale and high quality have attracted widespread attention from both industries and academia.

According to sources, the first batch of open-source “WanJuan-CC” corpus covers publicly available content on the internet from the past decade, with a total volume of 100 billion characters (100B tokens), approximately 400GB of high-quality English data. Such a large dataset not only provides abundant raw materials for AI model training but also offers strong support for AI research and development.

As a high-quality large model pre-training corpus, WanJuan-CC has significant advantages in content richness and diversity. It includes various types of text information, such as news, blogs, forum discussions, and covers knowledge in multiple languages and fields, enabling AI models to better understand and respond to complex scenarios during training.

The “WanJuan-CC” corpus released by the Shanghai AI Lab will provide large-scale, high-quality data support for academia and industry, helping to build smarter and more reliable AI models. This is of great significance for promoting the development of AI technology in our country and improving the level of intelligence of AI applications.

At the same time, the open-source WanJuan-CC corpus also demonstrates the Shanghai AI Lab’s commitment to academic research in the field of AI. By sharing high-quality data resources, the Shanghai AI Lab hopes to promote innovation in the AI field, facilitate academic exchanges and cooperation, and contribute to the development of global AI technology.

In summary, the high-quality large model pre-training corpus “WanJuan-CC” released by the Shanghai AI Lab will provide strong data support for AI research and development in our country and even globally. We have reason to believe that WanJuan-CC will play an important role in building smarter and more reliable AI models in the future development of AI technology.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg