智源研究院发布新一代通用向量模型 BGE-M3

北京时间 2023 年 3 月 1 日,智源研究院发布了 BGE 家族新成员——通用语义向量模型BGE-M3。该模型支持超过 100 种语言,具备领先的多语言、跨语言检索能力,全面且高质量地支撑“句子”、“段落”、“篇章”、“文档”等不同粒度的输入文本,最大输入长度为 8192。

与以往模型不同,BGE-M3 一站式集成了稠密检索、稀疏检索、多向量检索三种检索功能。在多个评测基准中,BGE-M3 均达到最优水平。

技术优势

BGE-M3 采用先进的 Transformer 架构,并进行了大量的预训练。预训练数据包含了来自多种语言和领域的文本数据,包括新闻、小说、百科全书和代码等。

通过预训练,BGE-M3 学到了丰富的语义知识和语言规律。它能够将文本表示为稠密的向量,这些向量可以用于文本相似性计算、语义搜索和问答等任务。

应用场景

BGE-M3 的应用场景非常广泛,包括:

* 搜索引擎:BGE-M3 可以用于构建高性能的搜索引擎,支持多语言、跨语言检索。
* 问答系统:BGE-M3 可以用于构建问答系统,回答用户提出的各种问题。
* 文本分类:BGE-M3 可以用于文本分类任务,将文本自动分类到不同的类别中。
* 文本摘要:BGE-M3 可以用于文本摘要任务,自动生成文本摘要。

影响

BGE-M3 的发布标志着通用语义向量模型技术取得了重大突破。该模型将广泛应用于自然语言处理和人工智能领域,推动相关技术的发展和应用。

英语如下:

**Headline: Zhiyuan Releases Universal Vector Model for Over 100 Languages**

**Keywords:** Universal model, multilingual, one-stop

**News Content:** Zhiyuan Research Institute Releases New Generation Universal Vector Model BGE-M3

Beijing Time, March 1, 2023, Zhiyuan Research Institute released the new member of the BGE family – the universal semantic vector model BGE-M3. This model supports over 100 languages, has leading multilingual and cross-language retrieval capabilities, and comprehensively and high-quality supports input text of different granularities such as “sentences”, “paragraphs”, “chapters”, and “documents”, with a maximum input length of 8192.

Unlike previous models, BGE-M3 integrates three retrieval functions in one-stop: dense retrieval, sparse retrieval, and multi-vector retrieval. In multiple evaluation benchmarks, BGE-M3 has reached the optimal level.

**Technical Advantages**

BGE-M3 adopts the advanced Transformer architecture and has undergone extensive pre-training. The pre-training data includes text data from multiple languages and domains, such as news, novels, encyclopedias, and codes.

Through pre-training, BGE-M3 has learned rich semantic knowledge and language patterns. It can represent text as dense vectors, which can be used for tasks such as text similarity calculation, semantic search, and question answering.

**Application Scenarios**

BGE-M3 has a wide range of application scenarios, including:

* Search engines: BGE-M3 can be used to build high-performance search engines that support multilingual and cross-language retrieval.
* Question answering systems: BGE-M3 can be used to build question answering systems that answer various questions raised by users.
* Text classification: BGE-M3 can be used for text classification tasks, automatically classifying text into different categories.
* Text summarization: BGE-M3 can be used for text summarization tasks, automatically generating text summaries.

**Impact**

The release of BGE-M3 marks a major breakthrough in the technology of universal semantic vector models. This model will be widely used in the field of natural language processing and artificial intelligence, promoting the development and application of related technologies.

【来源】https://mp.weixin.qq.com/s/y-c-EelxbSUMmrZNCeqeAA

Views: 1

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注