上海人工智能实验室近日宣布发布新一代高质量大模型预训练语料“万卷CC”,该语料覆盖过去十年互联网上的公开内容,包含1千亿字符,约400GB的高质量英文数据。作为“大模型语料数据联盟”今年首发的开源语料,“万卷CC”将为学术界和产业界提供大规模、高质量的数据支撑,助力构建更智能可靠的AI大模型。
英文标题:Shanghai AI Lab Unveils High-Quality Pre-training Corpus ‘WanJuan-CC’ for Next-Gen AI Models
英文关键词:AI Model, Open Source, Corpus
英文新闻内容:The Shanghai AI Lab has recently announced the release of a new high-quality corpus for AI model pre-training, named ‘WanJuan-CC’. This corpus covers publicly available content from the past decade, consisting of 100 billion characters and approximately 400 GB of high-quality English data. As the first open-source corpus released by the ‘Big Model Corpus Data Alliance’ this year, ‘WanJuan-CC’ will provide a large-scale and high-quality data support for both academia and industry, aiding in the development of more intelligent and reliable AI models.
【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg
Views: 2