Hugging Face开源巨量数据集Cosmopedia：打造

作者智能小编

4 月 9, 2024 #AI数据集, #Cosmopedia, #每日AI快讯

【Hugging Face开源全球最大AI训练数据集Cosmopedia】全球知名人工智能平台Hugging Face近日宣布，正式开源其最新打造的AI训练数据集——Cosmopedia。据称，Cosmopedia是迄今为止世界上规模最大的合成数据集，旨在为AI模型的训练提供更为丰富和多样的素材。

Cosmopedia的数据集由Hugging Face与Mixtral 7b模型合作生成，包含了超过3000万个文本文件，这些文件涵盖了教科书、博客文章、故事小说以及WikiHow教程等多种类型的内容，总计含有250亿个Token。这一海量的数据资源将极大地推动AI在自然语言处理领域的学习和进步。

Hugging Face表示，开放Cosmopedia的目的是促进AI研究的公平性和透明度，让全球的研究人员和开发者都能免费访问并利用这些数据，以提升他们的AI模型性能，推动人工智能技术的边界不断拓展。这一举措无疑将为AI领域的创新带来新的动力，同时也体现了Hugging Face对开源社区的持续贡献和支持。

随着Cosmopedia的发布，我们有望见证AI在理解和生成自然语言上的能力得到显著提升，为人工智能在教育、媒体、科技等领域的应用打开新的可能。

英语如下：

**News Title:** “Hugging Face Launches Open-Source Cosmopedia: The World’s Largest AI Training Dataset”

**Keywords:** Hugging Face, Cosmopedia, AI dataset

**News Content:**

**Hugging Face Unveils Cosmopedia, the Largest Open-Source AI Training Dataset Globally** The renowned artificial intelligence platform Hugging Face recently announced the official release of its latest AI training dataset, Cosmopedia, which it claims to be the world’s largest synthetic dataset to date. Aimed at providing a more extensive and diverse resource for training AI models.

Generated in collaboration with the Mixtral 7b model, Cosmopedia encompasses over 30 million text files, consisting of textbooks, blog posts, fictional stories, and WikiHow tutorials, amounting to a total of 250 billion Tokens. This massive data reservoir is poised to significantly advance AI learning and progress in natural language processing.

Hugging Face states that the objective of making Cosmopedia open-source is to foster fairness and transparency in AI research, allowing researchers and developers worldwide free access to these resources, thereby enhancing the performance of their AI models and pushing the boundaries of artificial intelligence technology. This move is set to inject new impetus into innovation within the AI field and underscores Hugging Face’s ongoing commitment and support to the open-source community.

With the introduction of Cosmopedia, we can anticipate a substantial improvement in AI’s ability to understand and generate natural language, opening up new possibilities for AI applications in education, media, technology, and beyond.

【来源】https://www.ithome.com/0/751/688.htm