英伟达FlashAttention 3：加速大模型计算新突破

英伟达近日宣布推出最新版的FlashAttention算法——FlashAttention 3，这一突破性技术再次证明了英伟达在人工智能领域持续创新的决心。FlashAttention 3在算法、并行化和工作分区上进行了显著优化，旨在大幅提升大型语言模型（LLM）的性能，特别是在扩展模型上下文窗口方面。

在AI领域，Transformer架构因其卓越的性能而备受青睐，尤其在语言理解与生成任务中展现出巨大潜力。然而，随着模型规模的扩大，Transformer架构中注意力层的时间复杂度和空间复杂度也随之上升，这使得在有限的硬件资源下扩展模型上下文窗口变得极具挑战性。正是在这样的背景下，FlashAttention 1应运而生，它通过重新排序注意力计算、利用tiling和重计算技术，将注意力计算的内存使用量从序列长度的二次减少到线性，显著提升了计算效率。

2023年，随着FlashAttention 2的推出，英伟达等研究团队在算法、并行化和工作分区上进行了更深层次的优化，进一步提升了计算效率和内存利用率。而此次发布的FlashAttention 3，则是这一系列创新的集大成者。它采用了加速Hopper GPU注意力的三种核心技术：通过warp-specialization实现整体计算和数据移动的重叠；交错分块matmul和softmax运算，减少计算瓶颈；以及充分利用硬件支持，进一步优化了计算性能。

FlashAttention 3的推出，不仅标志着AI领域计算效率和资源利用的又一次飞跃，也为大型语言模型的加速落地提供了更强大的技术支持。随着这一技术的广泛应用，预计将进一步推动AI在自然语言处理、对话系统、机器翻译等领域的应用，加速人工智能技术的普及与深化。

英语如下：

News Title: “NVIDIA’s FlashAttention 3: A New Leap in Accelerating Large Model Computations”

Keywords: NVIDIA, FlashAttention-3, GPU Optimization

News Content: NVIDIA has recently announced the launch of its latest version of the FlashAttention algorithm, FlashAttention 3, which is a testament to the company’s ongoing commitment to innovation in the AI field. This breakthrough technology has significantly optimized the algorithm, parallelization, and task partitioning, aiming to substantially enhance the performance of large language models (LLMs), particularly in expanding model context windows.

In the AI domain, the Transformer architecture has been highly favored for its exceptional performance, especially in tasks involving language understanding and generation. However, as model sizes increase, the time and space complexity of attention layers in the Transformer architecture also rise, making it challenging to scale model context windows within limited hardware resources. This is where FlashAttention 1 came into play, introducing a novel approach that reorders attention calculations, utilizes tiling and recomputation techniques, reducing the memory usage of attention calculations from quadratic to linear with respect to sequence length, thus significantly boosting computational efficiency.

In 2023, with the release of FlashAttention 2, NVIDIA and other research teams took their optimization efforts to a deeper level in terms of algorithms, parallelization, and task partitioning, further enhancing computational efficiency and memory utilization. The launch of FlashAttention 3 is the culmination of these series of innovations. It leverages three core technologies to accelerate attention on Hopper GPUs: achieving overlapping of overall computation and data movement through warp-specialization; interleaving matrix multiplication and softmax operations to alleviate computational bottlenecks; and fully exploiting hardware support to further optimize computational performance.

The release of FlashAttention 3 not only signifies a leap in computational efficiency and resource utilization in the AI field but also provides a stronger technical support for the acceleration of large language models’ deployment. As this technology is widely adopted, it is expected to further drive the application of AI in natural language processing, dialogue systems, machine translation, and other areas, accelerating the popularization and deepening of AI technology.

【来源】https://www.jiqizhixin.com/articles/2024-07-12-6