Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

Hugging Face近日宣布开源了一款名为“Cosmopedia”的AI训练数据集,这是迄今为止世界上最大的合成数据集。该数据集由Mixtral 7b模型汇总生成,包含了3000万以上的文本文件,共计250亿个Token,涵盖了教科书、博客文章、故事小说、WikiHow教程等多种内容类型。这一数据集的开放,无疑将为人工智能领域的研究提供了宝贵的数据资源,有助于推动AI模型的训练和进步。

英语如下:

News Title: “Hugging Face Opens Source the Largest AI Synthetic Dataset ‘Cosmopedia’ Globally”

Keywords: Open-source Dataset, Artificial Intelligence, Text Generation

News Content: Hugging Face recently announced the open-source release of an AI training dataset named “Cosmopedia.” This is the largest synthetic dataset to date, created by the Mixtral 7b model, which compiles over 30 million text files, totaling 25 billion Tokens. The dataset encompasses a variety of content types, including textbooks, blog posts, short stories, and WikiHow tutorials. The availability of this dataset will undoubtedly provide valuable data resources for research in the artificial intelligence field, aiding in the training and advancement of AI models.

【来源】https://www.ithome.com/0/751/688.htm

Views: 1

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注