上海AI实验室发布新一代高质量大模型预训练语料“万卷CC”

作者智能小编

3 月 15, 2024 #AI模型开源语料库, #每日AI快讯

上海人工智能实验室近日宣布发布新一代高质量大模型预训练语料“万卷CC”，该语料覆盖过去十年互联网上的公开内容，包含1千亿字符，约400GB的高质量英文数据。作为“大模型语料数据联盟”今年首发的开源语料，“万卷CC”将为学术界和产业界提供大规模、高质量的数据支撑，助力构建更智能可靠的AI大模型。

英文标题：Shanghai AI Lab Unveils High-Quality Pre-training Corpus ‘WanJuan-CC’ for Next-Gen AI Models

英文关键词：AI Model, Open Source, Corpus

英文新闻内容：The Shanghai AI Lab has recently announced the release of a new high-quality corpus for AI model pre-training, named ‘WanJuan-CC’. This corpus covers publicly available content from the past decade, consisting of 100 billion characters and approximately 400 GB of high-quality English data. As the first open-source corpus released by the ‘Big Model Corpus Data Alliance’ this year, ‘WanJuan-CC’ will provide a large-scale and high-quality data support for both academia and industry, aiding in the development of more intelligent and reliable AI models.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg