上海人工智能实验室(上海AI实验室)近日宣布发布新一代高质量大模型预训练语料——“万卷CC”。该语料覆盖了过去十年互联网上的公开内容,总字符数达1千亿,规模达到约400GB的高质量英文数据。作为“大模型语料数据联盟”今年首个开源项目,“万卷CC”将为学术界和产业界提供关键的数据支撑,推动更智能、更可靠的AI大模型的发展。

英文标题:AI Lab in Shanghai Releases High-Quality Dataset “WanJuan-CC” for Advancing Smart and Reliable Large Models

英文关键词:Artificial Intelligence, Large Model, Open-Source Data

英文新闻内容:
The AI Lab in Shanghai has recently announced the release of a new generation of high-quality dataset for pre-training large models, “WanJuan-CC.” Covering ten years of publicly available internet content, the dataset contains 100 billion characters, equating to approximately 400GB of high-quality English data. As the first open-source project of the “Large Model Corpus Data Alliance” this year, “WanJuan-CC” will provide critical data support to both academia and industry, driving the development of smarter and more reliable AI large models.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg

Views: 0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注