Hugging Face 发布全球最大合成数据集 Cosmope

作者智能小编

3 月 20, 2024 #AI训练数据集, #合成数据, #每日AI快讯

Hugging Face 近日宣布，该公司已经开源了一款名为“Cosmopedia”的 AI 训练数据集，该数据集被认为是目前世界上最大的合成数据集。Cosmopedia 由 Mixtral 7b 模型汇总生成，包含了超过 3000 万以上的文本文件，其中包括大量的教科书、博客文章、故事小说、WikiHow 教程等内容，总共有 250 亿个 Token。这一发布标志着 AI 技术在数据处理和分析方面的一大进步，同时也为研究人员和开发者提供了一个宝贵的学习和研究资源。

英文标题：Hugging Face Releases Cosmopedia, the World’s Largest Synthetic Dataset for AI Training

英文关键词：AI training dataset, synthetic data, Mixtral 7b model, text documents, tokens

英文新闻内容：
Hugging Face has recently announced the open-source release of Cosmopedia, which is claimed to be the world’s largest synthetic dataset for AI training. Compiled by the Mixtral 7b model, Cosmopedia encompasses over 3 million text files, including a wealth of content from textbooks, blog posts, stories, and WikiHow tutorials, totaling 25 billion tokens. This significant release represents a major advancement in AI technology for data processing and analysis, and serves as a valuable resource for researchers and developers alike.

【来源】https://www.ithome.com/0/751/688.htm