Hugging Face近日宣布开源了一款名为“Cosmopedia”的AI训练数据集,这是迄今为止世界上最大的合成数据集。该数据集由Mixtral 7b模型汇总生成,包含了3000万以上的文本文件,共计250亿个Token,涵盖了教科书、博客文章、故事小说、WikiHow教程等多种内容类型。这一数据集的开放,无疑将为人工智能领域的研究提供了宝贵的数据资源,有助于推动AI模型的训练和进步。
英语如下:
News Title: “Hugging Face Opens Source the Largest AI Synthetic Dataset ‘Cosmopedia’ Globally”
Keywords: Open-source Dataset, Artificial Intelligence, Text Generation
News Content: Hugging Face recently announced the open-source release of an AI training dataset named “Cosmopedia.” This is the largest synthetic dataset to date, created by the Mixtral 7b model, which compiles over 30 million text files, totaling 25 billion Tokens. The dataset encompasses a variety of content types, including textbooks, blog posts, short stories, and WikiHow tutorials. The availability of this dataset will undoubtedly provide valuable data resources for research in the artificial intelligence field, aiding in the training and advancement of AI models.
【来源】https://www.ithome.com/0/751/688.htm
Views: 1