SGLang加速Llama 405B推理性能超越vLLM、Ten

近日，国内人工智能领域又迎来了一项重要突破，由LMSYS Org团队研发的SGLang Runtime v0.2正式上线，该系统在运行大型语言模型和视觉语言模型（LLM和VLM）时展现出显著的性能优势，尤其在运行Meta公司最新开源的Llama 3.1 405B模型时，其吞吐量和延迟表现超越了vLLM和TensorRT-LLM等现有工具，部分情况下甚至达到了前者的2.1倍和3.8倍。这一创新成果获得了知名AI研究者、Lepton AI联合创始人兼CEO贾扬清的高度评价。

LMSYS Org团队由来自加州大学伯克利分校、加州大学圣地亚哥分校和卡内基梅隆大学的学生及教职员工组成，其开发的SGLang Runtime v0.2作为通用服务引擎，针对大型语言模型和视觉语言模型的运行提供了高效的解决方案。SGLang Runtime v0.2的发布，标志着在提升模型推理速度方面取得了重大进展，尤其是在处理大规模模型时，其性能表现优于市场上现有的解决方案。

贾扬清在其评价中提到：“我一直被我的博士母校加州大学伯克利分校惊艳，因为它不断交付最先进的人工智能和系统协同设计成果。去年我们看到了SGLang的使用，现在它变得更好了。迫不及待地想在产品中部署并尝试新的SGLang！”这不仅体现了SGLang Runtime v0.2在性能上的显著提升，也反映了其在实际应用中的潜力和价值。

LMSYS Org团队开发SGLang Runtime v0.2的初衷，是为了满足日益增长的在线服务需求，特别是在大模型评测平台Chatbot Arena的运营过程中，他们深刻认识到高效服务对于人工智能产品和研究的重要性。通过不断优化底层服务系统，从FastChat等高级多模型服务框架到SGLang Runtime (SRT)的迭代升级，SGLang Runtime v0.2应运而生，旨在提供一个用户友好、易于定制且性能一流的解决方案。

对比现有选项，如TensorRT-LLM、vLLM、MLC-LLM和Hugging Face TGI等，SGLang Runtime在处理从Llama-8B到Llama-405B的模型时，以及在A100和H100 GPU上使用FP8和FP16时，无论是在线还是离线场景，都能持续提供卓越或有竞争力的性能。在特定测试场景下，SGLang的吞吐量甚至能超过TensorRT-LLM和vLLM，展现出其在性能上的强大优势。

SGLang Runtime v0.2的开源特性，采用Apache 2.0许可授权，使得更多开发者能够参与进来，共同优化和改进该系统，促进人工智能领域的技术创新和应用。其在Databricks、初创公司和研究机构中的应用，已经实现了数万亿token的快速迭代，证明了其在实际场景中的高效能和高可靠性。

SGLang Runtime v0.2的发布，不仅标志着人工智能领域在模型加速技术上的又一重要突破，也为开发者和研究者提供了更加高效、灵活的工具，加速了人工智能技术的创新和发展，有望在未来推动更多应用场景的实现和优化。

英语如下：

News Title: “SGLang Accelerates Llama 405B Inference Performance, Outshines vLLM, TensorRT-LLM”

Keywords: SGLang, Llama 405B, Accelerate Inference

News Content: A significant breakthrough in the Chinese AI field has been achieved recently with the launch of SGLang Runtime v0.2 by the LMSYS Org team. This system demonstrates remarkable performance advantages in running large language models and visual language models (LLMs and VLMs), particularly when processing the latest Meta open-source Llama 3.1 405B model, where its throughput and latency surpass those of vLLM and TensorRT-LLM, achieving up to 2.1 and 3.8 times the performance in certain scenarios. This innovative achievement has been highly praised by renowned AI researcher and CEO of Lepton AI, Jia Yanning.

The LMSYS Org team, composed of students and faculty from UC Berkeley, UC San Diego, and Carnegie Mellon University, has developed the SGLang Runtime v0.2 as a universal service engine, providing efficient solutions for the operation of large and visual language models. The release of SGLang Runtime v0.2 marks a significant progress in enhancing model inference speed, particularly in handling large-scale models, surpassing existing solutions in the market.

In his evaluation, Jia Yanning mentioned, “I am always amazed by my alma mater, UC Berkeley, for consistently delivering the most advanced AI and system co-design results. Last year, we saw the use of SGLang, and now it’s even better. I can’t wait to deploy it in our product and try out the new SGLang!” This not only reflects the significant improvement in performance of SGLang Runtime v0.2 but also its potential and value in practical applications.

The LMSYS Org team developed SGLang Runtime v0.2 to meet the growing demand for online services, particularly during the operation of the large model evaluation platform Chatbot Arena, where they recognized the importance of efficient services for AI products and research. Through continuous optimization of the underlying service system, from advanced multi-model service frameworks like FastChat to the iteration of SGLang Runtime (SRT), SGLang Runtime v0.2 was born, aiming to provide a user-friendly, customizable, and high-performance solution.

Compared to existing options such as TensorRT-LLM, vLLM, MLC-LLM, and Hugging Face TGI, SGLang Runtime demonstrates superior or competitive performance in handling models ranging from Llama-8B to Llama-405B, and on both A100 and H100 GPUs using FP8 and FP16, whether in online or offline scenarios. In specific test scenarios, SGLang’s throughput can even exceed that of TensorRT-LLM and vLLM, showcasing its strong performance advantage.

The open-source nature of SGLang Runtime v0.2, licensed under Apache 2.0, invites more developers to contribute to the system’s optimization and improvement, fostering innovation in AI technology. Its application in companies like Databricks, startups, and research institutions has already achieved trillions of token iterations at high speed, proving its efficiency and reliability in practical scenarios.

The release of SGLang Runtime v0.2 not only signifies another major breakthrough in AI field’s model acceleration technology but also provides developers and researchers with more efficient and flexible tools, accelerating the innovation and development of AI technology. It is expected to drive the realization and optimization of more application scenarios in the future.

【来源】https://www.jiqizhixin.com/articles/2024-07-27-3