上海人工智能实验室近日推出了一项重大创新,开源发布了新一代的高质量语料库——“万卷CC”(WanJuan-CC)。这一大规模的预训练语料库涵盖了过去十年互联网上的公开内容,总计包含1千亿字符,约400GB的英文数据,是迄今为止规模宏大的数据资源之一。
“万卷CC”作为“大模型语料数据联盟”今年的首个开源项目,彰显了上海AI实验室在推动人工智能领域的开放合作与资源共享方面的决心。该语料库的发布,旨在为学术界和产业界提供强大的数据支持,以推动更智能、更可靠的AI大模型的研发和应用。
据实验室介绍,这些数据经过精心筛选和处理,确保了其质量和准确性,能够满足AI模型训练的高要求。这一开源举措不仅将加速AI技术的创新步伐,还将促进全球科研人员和开发者之间的合作,共同推动人工智能技术的边界。
上海AI实验室的这一行动,无疑为全球AI研究和开发领域注入了新的活力,预计将对提升AI模型的性能和智能水平产生深远影响。未来,随着更多研究者和企业利用“万卷CC”进行模型训练,我们有望见证AI技术的更多突破和应用。
英语如下:
**News Title:** “Shanghai AI Lab Launches Open-Source Megacorpus ‘WanJuanCC’: A 100B-Character Milestone for AI Large Models”
**Keywords:** Shanghai AI Lab, WanJuanCC, Open-source Corpus
**News Content:**
The Shanghai Artificial Intelligence Laboratory recently made a significant breakthrough by open-sourcing its new high-quality corpus, “WanJuanCC” (WanJuan-CC). This large-scale pre-training corpus encompasses a decade of public content from the internet, consisting of a staggering 100 billion characters, approximately 400 GB of English data, making it one of the most extensive data resources to date.
As the first open-source project of the “Large Model Corpus Data Alliance” this year, WanJuanCC demonstrates the Shanghai AI Lab’s commitment to fostering open collaboration and resource sharing in the field of AI. The release of this corpus aims to provide strong data support for academia and industry, promoting the development and application of more intelligent and reliable AI large models.
The lab explains that the data have undergone meticulous curation and processing to ensure quality and accuracy, meeting the rigorous demands of AI model training. This open-source initiative is expected to accelerate innovation in AI technology and foster collaboration among global researchers and developers, jointly advancing the boundaries of artificial intelligence.
By taking this step, the Shanghai AI Lab has injected new vitality into the global AI research and development landscape, with profound implications for enhancing the performance and intelligence levels of AI models. In the future, as more researchers and companies leverage WanJuanCC for model training, we can anticipate groundbreaking advancements and applications in AI technology.
【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg
Views: 1