Google DeepMind Mixture Model 百万微型专家，超越传统Transformer

近日，Google DeepMind的研究人员发布了一项关于Transformer架构的创新研究，提出了一种名为“百万专家混合”的方法，该方法在保持计算效率的同时，极大地扩展了Transformer的潜力。这一突破性进展不仅为人工智能领域注入了新的活力，也为构建更大规模、更高效能的大型语言模型提供了可能。

传统的Transformer架构在处理大规模数据时，前馈层（FFW）的计算成本和激活内存会随着隐藏层宽度的增加而线性增长，这成为限制模型规模和性能提升的关键因素。为了解决这一问题，稀疏混合专家（MoE）架构应运而生。MoE架构通过将模型大小与计算成本分离，实现了在大语言模型（LLM）体量不断增大的背景下，保持模型性能和计算效率的双重目标。

然而，现有的MoE模型受限于专家数量的限制，难以充分发挥其潜力。为克服这一挑战，Google DeepMind的研究团队引入了一种参数高效的专家检索机制，通过乘积密钥技术从一百万个微型专家中进行稀疏检索，从而构建了一种全新的参数高效专家检索（PEER）架构。这一创新不仅大幅提升了计算效率，而且与密集FFW、粗粒度MoE和产品密钥存储器（PKM）层相比，展现出更卓越的性能表现。

PEER架构的引入，不仅标志着人工智能领域在提升模型效率和扩展能力上的又一重大突破，也为未来的语言模型设计提供了新的方向。通过更精细的专家管理策略，PEER架构有望在未来的大规模语言模型中发挥关键作用，推动自然语言处理技术的进一步发展，为实际应用带来更多可能。

英语如下：

News Title: “Google DeepMind’s Mixture Model: A Million Mini-Experts, Outpacing Traditional Transformers”

Keywords: Google, Mixture, Transformer

News Content: In a recent breakthrough, Google DeepMind researchers unveiled an innovative study on the Transformer architecture, introducing a novel approach known as the “Million Experts Mixture,” which significantly expands the potential of Transformers while maintaining computational efficiency. This pioneering advancement not only revitalizes the AI domain but also paves the way for the construction of larger, more efficient, and high-performing large language models.

Traditionally, the Transformer architecture’s computational cost and activation memory in the feed-forward layer (FFW) grow linearly with the width of the hidden layers, posing a critical barrier to model scalability and performance enhancement. To address this issue, the Sparse Mixture of Experts (MoE) architecture emerged, enabling the separation of model size from computational cost. This approach allows for the maintenance of both model performance and computational efficiency in the face of continuously growing large language models (LLMs).

However, existing MoE models are limited by the number of experts, hindering their full potential. To overcome this challenge, Google DeepMind’s research team introduced an efficient expert retrieval mechanism, leveraging product key techniques to sparsely retrieve from a million miniature experts. This innovation, known as Parameter-Efficient Expert Retrieval (PEER), not only significantly boosts computational efficiency but also outperforms dense FFW, coarse-grained MoE, and Product Key Memory (PKM) layers in terms of performance.

The introduction of the PEER architecture marks a significant leap forward in the AI field’s pursuit of enhancing model efficiency and expanding capabilities. It opens new avenues for future language model design, with the potential to play a pivotal role in large-scale language models, driving the evolution of natural language processing technology and unlocking new possibilities for practical applications.

【来源】https://www.jiqizhixin.com/articles/2024-07-10-11