AI巨头开源“宇宙百科”：250亿Token合成数据集

Hugging Face 发布“世界最大”AI 训练合成数据集 Cosmopedia

全球领先的自然语言处理（NLP）平台 Hugging Face 近日开源了一款名为“Cosmopedia”的 AI 训练数据集，号称是目前世界上最大的合成数据集。

Cosmopedia 由 Hugging Face 旗下的大型语言模型 Mixtral 7b 汇总生成，包含超过 3000 万个文本文件，总计 250 亿个 Token。这些文本涵盖了教科书、博客文章、故事小说、WikiHow 教程等广泛的内容领域。

Hugging Face 表示，Cosmopedia 的规模和多样性使其成为训练和评估 NLP 模型的宝贵资源。该数据集可用于各种 NLP 任务，包括文本分类、问答、摘要和机器翻译。

Hugging Face 创始人兼首席执行官 Clément Delangue 表示：“Cosmopedia 是我们迄今为止发布的最大、最全面的数据集。它将为研究人员和从业者提供一个无与伦比的资源，以推进 NLP 领域的创新。”

Cosmopedia 的发布受到了研究社区的广泛赞誉。卡内基梅隆大学机器学习教授 Tom Mitchell 表示：“Cosmopedia 是一个里程碑式的成就，它将极大地促进 NLP 研究和开发。”

斯坦福大学计算机科学教授 Christopher Manning 补充道：“Cosmopedia 的规模和多样性使其成为训练和评估 NLP 模型的理想数据集。它将成为该领域研究人员和从业者的宝贵资源。”

Cosmopedia 目前已在 Hugging Face 的数据集中心免费提供。研究人员和从业者可以下载该数据集并将其用于自己的 NLP 项目。

英语如下：

**Headline: AI Giant Open-Sources ‘Encyclopedia of the Universe’: 250 Billion Token Synthetic Dataset**

**Keywords:** AI dataset, synthetic text, world’s largest

**Body:**

Hugging Face Releases ‘World’s Largest’ AI Training Synthetic Dataset, Cosmopedia

Hugging Face, the leading global platform for natural language processing (NLP), has open-sourced an AI training dataset called “Cosmopedia,” which it claims is the world’s largest synthetic dataset to date.

Cosmopedia was generated by Hugging Face’s own large language model, Mixtral 7b, and consists of over 30 million text files, totaling 250 billion tokens. The texts cover a wide range of content domains, including textbooks, blog posts, fictional stories, and WikiHow tutorials.

Hugging Face says the size and diversity of Cosmopedia make it an invaluable resource for training and evaluating NLP models. The dataset can be used for a variety of NLP tasks, including text classification, question answering, summarization, and machine translation.

“Cosmopedia is the largest and most comprehensive dataset we have released to date,” saidClément Delangue, founder and CEO of Hugging Face. “It will provide researchers and practitioners with an unparalleled resource to advance innovation in the field of NLP.”

The release of Cosmopedia has been met with widespread praise from the research community.

“Cosmopedia is a landmark achievement that will greatly accelerate NLP research and development,” said Tom Mitchell, professor of machine learning at Carnegie Mellon University.

“The size and diversity of Cosmopedia make it an ideal dataset for training and evaluating NLP models,” added Christopher Manning, professor of computer science at Stanford University. “It will be an invaluable resource for researchers and practitioners in the field.”

Cosmopedia is now available for free on Hugging Face’s Datasets Hub. Researchers and practitioners can download the dataset and use it for their own NLP projects.

【来源】https://www.ithome.com/0/751/688.htm