上海人工智能实验室近日宣布,正式发布新一代高质量大模型预训练语料“万卷CC”。该语料覆盖过去十年互联网上的公开内容,总字符数达1千亿,相当于400GB的高质量英文数据。这不仅是“大模型语料数据联盟”今年首次发布的开源语料,也将为学术界和产业界提供宝贵的数据资源,推动更加智能可靠的AI大模型的发展。

英文标题:Shanghai AI Lab Releases New Corpus “WanJuan-CC” for Open Source AI Model Training

英文关键词:AI, Corpus, Large-scale Model Training

英文新闻内容:The Shanghai AI Lab has recently announced the launch of “WanJuan-CC,” a new generation of high-quality corpus for AI model pre-training. Covering public content from the past decade on the Internet, WanJuan-CC contains 100 billion characters, equivalent to 400 GB of high-quality English data. As the first open-source corpus released by the Large-scale Model Corpus Data Alliance this year, WanJuan-CC will provide a valuable resource for both the academic and industrial communities, facilitating the development of more intelligent and reliable AI large models.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg

Views: 1

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注