从零构建700亿参数大模型：Imbue的自研基础设施揭秘

AI初创公司Imbue近日公开了其从零构建700亿参数量语言模型（LLM）的基础设施搭建过程，该模型在推理任务上表现出色，超越了零样本的GPT-4。Imbue致力于通过深入理解机器的思维方式来实现通用智能，其团队通过几个月的努力，成功构建了一套用于训练大规模语言模型的基础设施，并在过程中积累了丰富的经验。

Imbue的基础设施搭建始于组合初始集群和安装操作系统，随后团队面临了一系列挑战，包括配置各台机器、设置InfiniBand网络以确保机器健康、诊断和解决训练过程中遇到的问题，以及改进基础设施工具。通过与Voltage Park的合作，Imbue团队成功准备了包含4088台H100 GPU的集群，这些GPU分布在511台计算机上，每台计算机配备8台GPU，形成了一张完全无阻塞的三层InfiniBand网络架构，实现了高效率的数据传输。

为了确保基础设施的稳定性和可靠性，Imbue团队开发了一系列脚本，以辅助监控、检查和纠错。整个过程中，他们特别强调了InfiniBand网络在训练过程中通信的重要性，因为相较于以太网，InfiniBand能提供更高的数据传输速度和更低的延迟，从而加速模型训练过程。

Imbue的这一成就不仅展示了构建大规模语言模型所需的技术挑战和创新解决方案，也为其他研究者和团队提供了宝贵的参考和借鉴，其公开的基础设施搭建过程和工具脚本有望加速人工智能领域的研究和应用。

英语如下：

News Title: “Building a 70B Parameter Giant Model from Scratch: Unveiling Imbue’s Homegrown Infrastructure”

Keywords: Bare Metal Clusters, Large Model Training, Automated Recovery

News Content: Title: From Bare Metal to a 70B Parameter Giant Model: AI Startup Imbue Reveals the Full Infrastructure Construction Process

AI startup Imbue has recently disclosed its process of constructing a language model (LLM) with 70 billion parameters from the ground up. This model excels in inference tasks, surpassing zero-shot GPT-4. Imbue is dedicated to achieving general intelligence by deeply understanding the cognitive processes of machines, and its team, after months of effort, successfully built an infrastructure for training large-scale language models. Throughout this process, they have gained rich experience.

The construction of Imbue’s infrastructure began with assembling a cluster of bare metal and installing an operating system. Subsequently, the team faced numerous challenges, including configuring each machine, setting up InfiniBand networks to ensure machine health, diagnosing and resolving issues encountered during training, and enhancing the foundational tools. Through collaboration with Voltage Park, Imbue’s team successfully prepared a cluster consisting of 4088 H100 GPUs, spread across 511 computers, with each computer equipped with 8 GPUs. This setup features a fully non-blocking three-layer InfiniBand network architecture, enabling efficient data transfer.

To ensure the stability and reliability of the infrastructure, Imbue developed a series of scripts to assist with monitoring, inspection, and error correction. Throughout the process, they emphasized the importance of InfiniBand networks in communication during training, as compared to Ethernet, InfiniBand offers higher data transmission speeds and lower latency, thus accelerating the model training process.

Imbue’s achievement not only showcases the technical challenges and innovative solutions required for building large-scale language models but also serves as a valuable reference and inspiration for other researchers and teams. The open disclosure of their infrastructure construction process and tool scripts is expected to expedite research and applications in the field of artificial intelligence.

【来源】https://www.jiqizhixin.com/articles/2024-07-24-4