上海AI实验室开源巨量语料库：万卷CC，打造智能未来！

作者智能小编

4 月 13, 2024 #万卷CC, #上海AI实验室, #每日AI快讯

新闻报道

上海人工智能实验室近日在全球范围内开源了一项重大科研成果——“万卷CC”（WanJuan-CC）语料库，这是一份涵盖过去十年互联网公开内容的高质量大模型预训练语料。该语料库规模宏大，包含1千亿字符，相当于100B token，总数据量约为400GB的英文数据，为迄今为止业界的一大壮举。

“万卷CC”作为“大模型语料数据联盟”今年的首推开源项目，旨在为学术界和产业界提供强有力的数据支持。这一开源举措将极大地推动AI模型的研发进程，帮助科研人员和开发者构建更智能、更可靠的人工智能系统。通过使用这些大规模、高质的数据，研究者们有望在自然语言处理、机器学习等领域取得新的突破。

上海AI实验室的这一创新行动，不仅彰显了其在人工智能领域的领先地位，也体现了中国在开放科学和共享资源方面的积极态度。此举有望激发全球科研合作，共同推动AI技术的进步，为未来的科技发展注入新的活力。

英语如下：

News Title: “Shanghai AI Lab Releases Massive Open-Source Corpus: WanJuan CC, Building a Smart Future!”

Keywords: Shanghai AI Lab, WanJuan CC, Open-source Corpus

News Content: The Shanghai Artificial Intelligence Laboratory has recently made a significant scientific achievement open-source globally – the “WanJuan CC” corpus, a large-scale, high-quality pre-training dataset containing a decade’s worth of public internet content. This vast corpus consists of 100 billion characters, equivalent to 100B tokens, and amounts to approximately 400GB of English data, marking a major milestone in the industry.

As the inaugural open-source project of the “Large Model Corpus Data Alliance” this year, WanJuan CC aims to provide strong data support for both academia and industry. This open-source initiative will significantly accelerate the development of AI models, assisting researchers and developers in building more intelligent and reliable artificial intelligence systems. By leveraging these extensive and high-quality datasets, researchers are expected to make new breakthroughs in natural language processing and machine learning.

This innovative move by the Shanghai AI Lab underscores its leading position in the field of artificial intelligence and demonstrates China’s proactive stance on open science and resource sharing. It is anticipated to stimulate global research collaboration, collectively advancing AI technology and injecting new vitality into future scientific progress.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg