开源AI训练数据集规模全球最大

作者智能小编

2 月 29, 2024 #3000万文本, #AI训练, #开源数据集, #每日AI快讯

news papper

IT之家报道,Hugging Face公司近日开源了一款名为“Cosmopedia”的AI训练数据集,其内容规模号称全球最大。该数据集由Hugging Face的大型语言模型Mixtra 7b汇总生成,包含3000万以上文本文件,涵盖教科书、博客文章、小说、教程等内容,共计250亿个词元。
业内人士分析,这一开源释出将极大地推动AI研发进程。一方面,海量文本数据是训练大型语言模型的重要基础;另一方面,内容丰富、样式多样的合成数据集,也将提升下游任务的性能。当然,合成数据的质量控制也值得关注。
专家表示,语言数据的开源共享,体现了研发机构推动行业共进的社会责任,也将促使相关监管政策与时俱进。相信在技术安全与伦理规范的前提下,开源开放仍是AI发展的重要动力。

Title: The world’s largest open source AI training dataset
Keywords: Open source dataset, AI training, 30 million texts
News content: IT House reported that Hugging Face recently open sourced an AI training dataset called “Cosmopedia”, which is claimed to be the largest in the world. The dataset is aggregated by Hugging Face’s large language model Mixtra 7b. It contains more than 30 million text files covering textbooks, blog articles, novels, tutorials, etc., with a total of 25 billion tokens.
Industry insiders analyze that this open source release will greatly promote the AI R&D process. On the one hand, massive text data is an important basis for training large language models; on the other hand, synthetic datasets with rich content and diverse styles will also improve the performance of downstream tasks. Of course, the quality control of synthetic data is also worth paying attention to.
Experts say that the open source sharing of language data reflects the social responsibility of R&D institutions to promote industry progress together, and will also prompt relevant regulatory policies to keep pace with the times. It is believed that under the premise of technical safety and ethical norms, open source is still an important driving force for AI development.

【来源】https://www.ithome.com/0/751/688.htm