上海AI实验室发布新语料“万卷CC”

作者智能小编

3 月 25, 2024 #人工智能, #每日AI快讯

上海人工智能实验室近日宣布，正式发布新一代高质量大模型预训练语料“万卷CC”。该语料覆盖过去十年互联网上的公开内容，总字符数达1千亿，相当于400GB的高质量英文数据。这不仅是“大模型语料数据联盟”今年首次发布的开源语料，也将为学术界和产业界提供宝贵的数据资源，推动更加智能可靠的AI大模型的发展。

英文标题：Shanghai AI Lab Releases New Corpus “WanJuan-CC” for Open Source AI Model Training

英文关键词：AI, Corpus, Large-scale Model Training

英文新闻内容：The Shanghai AI Lab has recently announced the launch of “WanJuan-CC,” a new generation of high-quality corpus for AI model pre-training. Covering public content from the past decade on the Internet, WanJuan-CC contains 100 billion characters, equivalent to 400 GB of high-quality English data. As the first open-source corpus released by the Large-scale Model Corpus Data Alliance this year, WanJuan-CC will provide a valuable resource for both the academic and industrial communities, facilitating the development of more intelligent and reliable AI large models.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg