Hugging Face 发布全球最大合成数据集

作者智能小编

3 月 31, 2024 #AI数据集, #内容多样性, #开源项目, #每日AI快讯

人工智能研究公司Hugging Face近期宣布开放其最新AI训练数据集——Cosmopedia。Cosmopedia是目前全球规模最大的合成数据集，由Mixtral 7b模型生成，涵盖了3000万篇以上的文本文件，内容丰富，包括教科书、博客文章、故事小说、WikiHow教程等，总计250亿个Token。Hugging Face表示，Cosmopedia的发布旨在推动AI技术的发展和研究，促进AI社区的进步。

Title: Hugging Face Unveils Largest Synthetic Dataset
Keywords: AI Dataset, Open Source Project, Content Diversity
News content:
Hugging Face, an artificial intelligence research firm, has recently announced the release of its latest AI training dataset – Cosmopedia. Cosmopedia is currently the world’s largest synthetic dataset, generated by the Mixtral 7b model, covering over 30 million text files, a rich collection of content including textbooks, blog posts, story novels, WikiHow tutorials, and more, totaling 25 billion tokens. Hugging Face states that the release of Cosmopedia is aimed at advancing AI technology and research, and promoting progress within the AI community.

【来源】https://www.ithome.com/0/751/688.htm