近日,知名人工智能公司 Hugging Face 宣布开源一款名为“Cosmopedia”的 AI 训练数据集,声称这是目前世界上最大的合成数据集。该数据集由 Mixtral 7b 模型生成,收录了超过 3000 万的文本文件,包括教科书、博客文章、故事小说、WikiHow 教程等,总共有 250 亿个 Token。这一举措旨在推动 AI 模型的发展,为研究人员和开发者提供丰富的学习资源。
英文标题:Hugging Face Releases Largest Synthetic Dataset Cosmopedia for AI Model Development
英文关键词:AI model development, synthetic data, Hugging Face
英文新闻内容:
Hugging Face, a leading artificial intelligence company, has announced the open-source release of a dataset called “Cosmopedia,” which it claims is the largest synthetic dataset available to date. The dataset, compiled by the Mixtral 7b model, contains over 30 million text files encompassing a wide range of content including textbooks, blog posts, fictional stories, and WikiHow tutorials, totaling 25 billion Tokens. This move is intended to advance the development of AI models by providing researchers and developers with a rich set of learning resources.
【来源】https://www.ithome.com/0/751/688.htm
Views: 1