上海AI实验室发布“万卷CC” 助力智能大模型发展

作者智能小编

3 月 21, 2024 #人工智能, #每日AI快讯

上海人工智能实验室（上海AI实验室）近日宣布发布新一代高质量大模型预训练语料——“万卷CC”。该语料覆盖了过去十年互联网上的公开内容，总字符数达1千亿，规模达到约400GB的高质量英文数据。作为“大模型语料数据联盟”今年首个开源项目，“万卷CC”将为学术界和产业界提供关键的数据支撑，推动更智能、更可靠的AI大模型的发展。

英文标题：AI Lab in Shanghai Releases High-Quality Dataset “WanJuan-CC” for Advancing Smart and Reliable Large Models

英文关键词：Artificial Intelligence, Large Model, Open-Source Data

英文新闻内容：
The AI Lab in Shanghai has recently announced the release of a new generation of high-quality dataset for pre-training large models, “WanJuan-CC.” Covering ten years of publicly available internet content, the dataset contains 100 billion characters, equating to approximately 400GB of high-quality English data. As the first open-source project of the “Large Model Corpus Data Alliance” this year, “WanJuan-CC” will provide critical data support to both academia and industry, driving the development of smarter and more reliable AI large models.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg