出门问问开源超大语言模型训练数据集

作者智能小编

3 月 6, 2024 #中文语料, #开源数据集, #每日AI快讯

news studio

出门问问开放“序列猴子”首个开源数据集

北京时间2023年3月1日，出门问问宣布向公众开放其超大规模语言模型“序列猴子”的部分训练数据集，命名为“序列猴子开源数据集1.0”。

“序列猴子开源数据集1.0”包含了中文通用文本语料、古诗今译语料以及文本生成语料。其中，中文通用文本语料涵盖了新闻、小说、百科全书等多种文本类型，规模达数百亿字；古诗今译语料包含了数万首古诗及其现代汉语译文；文本生成语料则包含了大量高质量的文本生成样本。

出门问问表示，开放“序列猴子开源数据集1.0”旨在促进自然语言处理领域的学术研究和产业应用。研究人员和开发者可以利用该数据集训练和评估自己的语言模型，开发新的自然语言处理技术和应用。

“序列猴子”是出门问问自主研发的超大规模语言模型，拥有超过1000亿个参数，在中文语言理解和生成任务上取得了业界领先的成绩。该模型已广泛应用于出门问问的智能问答、智能客服、智能推荐等产品中。

出门问问表示，未来将继续开放更多“序列猴子”的训练数据集，为自然语言处理领域的发展做出贡献。

英语如下：

**Headline:** Outbrain Opens Up Training Dataset for Its Large-Scale Language Model

**Keywords:** Open-source dataset, language model, Chinese corpus

**News Content:**

Outbrain Open-Sources First Dataset of Its “Sequence Monkey”

On March 1, 2023 (Beijing Time), Outbrain announced the public release of a portion of the training dataset for its large-scale language model, “Sequence Monkey,” under the name “Sequence Monkey Open-Source Dataset 1.0.”

“Sequence Monkey Open-Source Dataset1.0” includes a general Chinese text corpus, a corpus of classical Chinese poetry and its modern Chinese translations, and a text generation corpus. The general Chinese text corpus covers various text types, such as news, novels, and encyclopedias, and contains hundreds of billions of characters. The corpus of classical Chinese poetry and its modern Chinese translations includes tens of thousands of classical poems and their modern Chinese translations. The text generation corpus contains a large number of high-quality text generation samples.

Outbrain stated that the release of “Sequence Monkey Open-Source Dataset 1.0” aims to promote academic research and industrial applications inthe field of natural language processing. Researchers and developers can use the dataset to train and evaluate their own language models and develop new natural language processing technologies and applications.

“Sequence Monkey” is a large-scale language model independently developed by Outbrain, with over 100 billion parameters. It has achieved industry-leading performance in Chinese language comprehension and generation tasks. The model has been widely used in Outbrain’s products, such as intelligent Q&A, intelligent customer service, and intelligent recommendation.

Outbrain stated that it will continue to open up more of “Sequence Monkey’s” training datasets in the future, contributing to the development of the field of natural language processing.

【来源】https://mp.weixin.qq.com/s/oSQR3gCCDpJ3Wdu-9iTcbA