上海人工智能实验室近日在全球范围内开源发布了其精心打造的高质量语料库——“万卷CC”(WanJuan-CC)。这一创新举措标志着在人工智能领域的数据支持上迈出了重要一步。据官方消息,万卷CC包含了过去十年间互联网上的公开内容,数据量高达1千亿字符,约相当于400GB的英文数据,是迄今为止规模庞大的预训练语料库之一。
作为“大模型语料数据联盟”今年的首推开源项目,WanJuan-CC旨在为学术界和工业界提供强有力的数据基础,以推动更智能、更可靠的AI大模型的研发。这一开源语料库的发布,不仅将促进科研人员在自然语言处理、机器学习等领域进行更深入的研究,也将为相关产业的创新应用打开新的可能,加速AI技术的迭代进步。
上海AI实验室的这一举措,展示了中国在人工智能基础资源建设上的领先实力和开放共享的精神,有望在全球范围内引发对AI模型训练数据的新一轮关注和合作。通过“万卷CC”,全球的研究者和开发者都能免费获取到这些宝贵的数据,共同推动人工智能技术的边界,为未来的智能社会构建更坚实的基础。
英语如下:
News Title: “Shanghai AI Lab’s Groundbreaking Open-Source Release: The 100-Billion-Character ‘WanJuan CC’ Corpus Powers a New Era for AI Large Models”
Keywords: Shanghai AI Lab, WanJuan CC, Open-source Corpus
News Content: The Shanghai Artificial Intelligence Laboratory recently unveiled its meticulously crafted high-quality corpus, the “WanJuan CC,” to the global community in an open-source move. This groundbreaking step signifies a significant advancement in data support for the field of artificial intelligence. According to official statements, WanJuan CC encompasses a decade’s worth of public content from the internet, amounting to an impressive 100 billion characters, equivalent to approximately 400 GB of English data, making it one of the largest pre-training corpora to date.
As the inaugural open-source project of the “Large Model Corpus Data Alliance” this year, WanJuan CC aims to provide a robust data foundation for both academia and industry, fostering the development of more intelligent and reliable AI large models. The release of this open-source corpus not only facilitates deeper research in natural language processing and machine learning by researchers but also unlocks new possibilities for innovative applications in related industries, accelerating the iterative progress of AI technology.
This initiative by the Shanghai AI Lab demonstrates China’s leading prowess in AI infrastructure development and its commitment to openness and collaboration. It is expected to spark renewed global interest and cooperation in AI model training data. With “WanJuan CC,” researchers and developers worldwide can freely access these valuable resources, collectively advancing the boundaries of artificial intelligence technology and laying a firmer foundation for the future intelligent society.
【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg
Views: 1