上海人工智能实验室近日宣布,正式开源其新一代的高质量大模型预训练语料库——“万卷CC”(WanJuan-CC),标志着人工智能领域的数据资源建设迎来重要进展。该语料库涵盖了过去十年互联网上的公开内容,总字符数高达1千亿(100B token),约相当于400GB的英文数据,是迄今为止规模庞大的高质量语料资源之一。
作为“大模型语料数据联盟”今年的首推开源项目,WanJuan-CC的发布旨在为学术界和产业界提供强有力的数据支持,以促进更智能、更可靠的AI大模型的研发。这一开源举措有望打破数据获取的壁垒,鼓励更多的研究者和开发者参与到AI技术的创新中来,推动全球AI技术的快速发展和广泛应用。
上海AI实验室的这一创新行动,不仅展示了其在人工智能领域的技术实力和开放精神,也为全球AI研究和应用构建了一个共享、协作的平台。随着“万卷CC”语料库的开源,我们有望见证更多先进的人工智能成果涌现,为社会各行各业带来更深层次的智能化转型。
英语如下:
**News Title:** “Shanghai AI Lab Launches Massive Open-Source Corpus ‘WanJuan CC’: 100 Billion Characters Pave the Way for a New Era in AI Large Models”
**Keywords:** Shanghai AI Lab, WanJuan CC, Open-source Corpus
**News Content:**
The Shanghai Artificial Intelligence Laboratory has recently announced the official open-source release of its new high-quality large-scale pre-training corpus, “WanJuan CC” (WanJuan-CC), marking a significant milestone in the development of AI data resources. The corpus encompasses a decade’s worth of public content from the internet, totaling an enormous 100 billion characters (100B tokens), equivalent to approximately 400GB of English data, making it one of the largest and highest-quality corpus resources to date.
As the first open-source project of the “Large Model Corpus Data Alliance” this year, the launch of WanJuan-CC aims to provide robust data support to the academic and industrial sectors, fostering the development of more intelligent and reliable AI large models. This open-source initiative is expected to break down barriers to data access, encouraging more researchers and developers to engage in AI innovation, thereby accelerating global AI technology advancements and applications.
This innovative move by the Shanghai AI Lab demonstrates its technical prowess and commitment to openness in the field of artificial intelligence. It also establishes a shared and collaborative platform for global AI research and application. With the release of the WanJuan CC corpus, we can anticipate the emergence of more advanced AI achievements, driving deeper智能化 transformations across various industries.
【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg
Views: 1