Hugging Face 近日宣布,该公司已经开源了一款名为“Cosmopedia”的 AI 训练数据集,该数据集被认为是目前世界上最大的合成数据集。Cosmopedia 由 Mixtral 7b 模型汇总生成,包含了超过 3000 万以上的文本文件,其中包括大量的教科书、博客文章、故事小说、WikiHow 教程等内容,总共有 250 亿个 Token。这一发布标志着 AI 技术在数据处理和分析方面的一大进步,同时也为研究人员和开发者提供了一个宝贵的学习和研究资源。

英文标题:Hugging Face Releases Cosmopedia, the World’s Largest Synthetic Dataset for AI Training

英文关键词:AI training dataset, synthetic data, Mixtral 7b model, text documents, tokens

英文新闻内容:
Hugging Face has recently announced the open-source release of Cosmopedia, which is claimed to be the world’s largest synthetic dataset for AI training. Compiled by the Mixtral 7b model, Cosmopedia encompasses over 3 million text files, including a wealth of content from textbooks, blog posts, stories, and WikiHow tutorials, totaling 25 billion tokens. This significant release represents a major advancement in AI technology for data processing and analysis, and serves as a valuable resource for researchers and developers alike.

【来源】https://www.ithome.com/0/751/688.htm

Views: 1

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注