正文:
近日,苹果和 Meta AI 研究团队发布了一项新研究,旨在提高大型语言模型(LLM)的推理效率。这项名为 LazyLLM 的研究提出了一种新的动态 token 剪枝技术,可以在保证准确度不明显下降的同时,将 Llama 2 的预填充阶段推理速度提升至原来的两倍以上。
传统的基于 Transformer 的 LLM 在预填充阶段会计算和保存整个 prompt 中每个 token 的 KV 缓存,并预测首个 token。这一过程被称为首个 token 时间(TTFT),它通常在 LLMs 的推理过程中占据相当大的比例。然而,苹果和 Meta AI 的研究团队发现,对于生成首个 token 来说,并不是所有 prompt token 都是必需的。在 LongBench 基准测试中,他们观察到首个生成的 token 对于输入 token 的注意力分数非常稀疏,这意味着许多 token 可能是多余的。
基于这一观察,LazyLLM 方法采用了动态剪枝策略,允许模型在不同的生成步骤中选取不同的 token 子集。这种渐进式剪枝技术能够在预填充阶段显著减少计算量,从而加快 TTFT。与需要计算所有 KV 缓存的静态剪枝方法不同,LazyLLM 能够在推理的第一轮迭代中就开始优化,从而大幅提升了 LLM 的推理效率。
研究团队表示,LazyLLM 方法适用范围广,无需重新训练模型,且效果显著。通过对比标准 LLM 与 LazyLLM 的性能,研究人员展示了 LazyLLM 在提高推理速度的同时,还能保持模型的预测准确度。
这项研究为 LLMs 的推理优化提供了新的思路,有望为 Llama 3.1 等新一代模型的加速提供启发。随着 AI 技术的不断发展,提高模型的推理效率对于推动 AI 应用在各个领域的普及和深入发展具有重要意义。
英语如下:
News Title: “Apple and Meta AI Make New Breakthrough: LLM Acceleration by 2x Without Sacrificing Accuracy”
Keywords: Apple, Large Model, Lazy
News Content:
Title: Apple and Meta AI Research Team Proposes New Method to Boost Llama 2’s Pre-filling Inference Speed
Recent research by the Apple and Meta AI research team has unveiled a novel approach aimed at enhancing the inference efficiency of large language models (LLMs). Dubbed LazyLLM, this study introduces a dynamic token pruning technique that can double the inference speed of the pre-filling phase in Llama 2 without compromising accuracy.
Traditional Transformer-based LLMs calculate and store the KV cache for every token in the prompt during the pre-filling stage, predicting the first token. This process, known as First Token Time (TTFT), typically accounts for a significant portion of the inference process in LLMs. However, the researchers from Apple and Meta AI discovered that not all prompt tokens are necessary for generating the first token. In the LongBench benchmark test, they observed that the attention scores for the first generated token are very sparse with respect to the input tokens, indicating that many tokens may be redundant.
Building on this observation, the LazyLLM method employs a dynamic pruning strategy that allows the model to select different subsets of tokens at different generation steps. This progressive pruning technique significantly reduces computational load during the pre-filling stage, thereby speeding up TTFT. Unlike static pruning methods that require computing all KV caches, LazyLLM can start optimizing during the first round of inference iterations, significantly enhancing the efficiency of LLMs.
The research team claims that LazyLLM is universally applicable without the need for retraining the model and shows remarkable effectiveness. By comparing the performance of standard LLMs against LazyLLM, the researchers demonstrated that LazyLLM can improve inference speed without sacrificing model prediction accuracy.
This research offers a new perspective on optimizing the inference of LLMs and is expected to inspire the acceleration of Llama 3.1 and other next-generation models. As AI technology continues to evolve, improving the inference efficiency of models is crucial for the widespread and in-depth development of AI applications across various fields.
【来源】https://www.jiqizhixin.com/articles/2024-08-02-3
Views: 4