近日,微软官方宣布了一项突破性的研究进展,其成果在于通过优化计算技术,使得大型语言模型(LLM)的推理速度在单卡机器上实现了惊人的十倍提升,能够处理超过一百万字的输入文本。这一突破性进展不仅极大地提升了处理大规模文本数据的效率,更为重要的是,它标志着大型语言模型已正式进入长上下文处理时代,其支持的上下文窗口规模从之前的128K跃升至10M级。
然而,随着上下文窗口的扩大,模型在处理输入提示(预填充阶段)时遇到了前所未有的挑战。由于注意力机制的二次复杂度,模型生成首个token的时间变得异常漫长,这在一定程度上限制了长上下文LLM的广泛应用。例如,若在单台配备A100显卡的机器上为LLaMA-3-8B提供服务,当输入提示含有30万个token时,模型完成预填充阶段需要6分钟;若提升至100万个token,这个时间将延长至30分钟。在这个过程中,自注意力计算的开销占据了总预填充延迟的90%以上,成为制约长上下文处理的主要瓶颈。
针对这一挑战,微软与萨里大学的研究团队携手,提出了名为MInference的稀疏计算方法。MInference旨在通过优化计算资源的分配和利用,显著加速长序列处理的预填充阶段。这一创新方法不仅有效降低了计算延迟,而且在保持高准确性的前提下,实现了对长上下文LLM的高效处理,为语言模型的大规模应用开辟了新的可能性。随着MInference的推广和应用,我们有理由期待未来语言模型在文本生成、自然语言理解、对话系统等领域的性能将得到进一步的提升,为用户提供更加流畅、高效的服务体验。
微软的这一研究不仅展现了技术创新在推动人工智能领域发展中的关键作用,也为行业内的其他研究者提供了宝贵的启示和实践案例。随着技术的不断进步,我们有理由相信,未来的人工智能应用将更加贴近人类的需求,为社会带来更多的便利和创新。
英语如下:
### Microsoft’s New Research: Single Card Accelerates Text Processing by Millions, Revolutionizing Large Model Applications
In a recent breakthrough, Microsoft officially unveiled advancements in optimizing computational techniques that have enabled large language models (LLMs) to process over a million words of input text with a remarkable tenfold increase in speed on a single card machine. This significant leap not only boosts the efficiency of handling large-scale text data but also marks a pivotal shift into the era of long-context processing for LLMs, where the context window size has escalated from 128K to a 10M scale.
However, the expansion of context windows has introduced unprecedented challenges in handling input prompts (pre-filling stage) due to the quadratic complexity of attention mechanisms. The time required for the model to generate the first token has become extraordinarily lengthy, limiting the widespread application of long-context LLMs. For instance, serving LLaMA-3-8B on a machine equipped with an A100 GPU, providing input prompts with 300,000 tokens takes 6 minutes, and this duration escalates to 30 minutes for prompts with 1,000,000 tokens. In this process, the overhead of self-attention computations accounts for over 90% of the total pre-filling latency, acting as a significant bottleneck for long-context processing.
To tackle this challenge, Microsoft collaborated with the University of Surrey to introduce MInference, a sparse computing method. MInference aims to optimize resource allocation and utilization, significantly accelerating the pre-filling phase for long sequences. This innovative approach not only reduces computational delays effectively but also maintains high accuracy while enabling the efficient handling of long-context LLMs. As MInference is rolled out and implemented, it paves the way for the future of language models in text generation, natural language understanding, and dialogue systems, enhancing user experience with more fluid and efficient services.
This research by Microsoft not only underscores the pivotal role of technological innovation in driving advancements in the AI field but also offers valuable insights and practical examples for researchers in the industry. As technology continues to evolve, there is reason to believe that future AI applications will increasingly align with human needs, bringing about greater convenience and innovation to society.
【来源】https://www.jiqizhixin.com/articles/2024-07-08-17
Views: 2