Hugging Face开源全球最大AI合成数据集 Cosmopedia

作者智能小编

5 月 22, 2024 #人工智能, #开源数据集, #文本生成, #每日AI快讯

Hugging Face近日宣布开源了一款名为“Cosmopedia”的AI训练数据集，这是迄今为止世界上最大的合成数据集。该数据集由Mixtral 7b模型汇总生成，包含了3000万以上的文本文件，共计250亿个Token，涵盖了教科书、博客文章、故事小说、WikiHow教程等多种内容类型。这一数据集的开放，无疑将为人工智能领域的研究提供了宝贵的数据资源，有助于推动AI模型的训练和进步。

英语如下：

News Title: “Hugging Face Opens Source the Largest AI Synthetic Dataset ‘Cosmopedia’ Globally”

Keywords: Open-source Dataset, Artificial Intelligence, Text Generation

News Content: Hugging Face recently announced the open-source release of an AI training dataset named “Cosmopedia.” This is the largest synthetic dataset to date, created by the Mixtral 7b model, which compiles over 30 million text files, totaling 25 billion Tokens. The dataset encompasses a variety of content types, including textbooks, blog posts, short stories, and WikiHow tutorials. The availability of this dataset will undoubtedly provide valuable data resources for research in the artificial intelligence field, aiding in the training and advancement of AI models.

【来源】https://www.ithome.com/0/751/688.htm