大模型新趋势：硬件加速混合矩阵乘法

人工智能领域的快速发展催生了对高性能计算设备的需求，特别是在模型参数不断增加的趋势下，终端设备的算力和内存需求也在不断上升。为了解决这一挑战，微软亚洲研究院推出了一系列创新技术，旨在让现有的硬件设备直接支持混合精度矩阵乘法，从而提高大模型的推理效率。

微软亚洲研究院的研究员们开发了数据编译器Ladder和算法T-MAC，这两项技术能够让当前只支持对称精度计算的硬件设备直接运行混合精度矩阵乘法。测试结果显示，Ladder在支持GPU原本不支持的自定义数据类型方面，最高提速可达14.6倍；T-MAC在搭载了最新高通Snapdragon X Elite芯片组的Surface AI PC上，使CPU上的大模型吞吐率比专用加速器NPU快两倍。此外，研究人员还设计了LUT Tensor Core硬件架构，这种精简设计使硬件能够直接支持各种低比特混合精度计算，为人工智能硬件设计提供了新思路。

随着低比特量化技术的发展，大模型在端侧设备的部署和推理中越来越多地采用混合精度矩阵乘法。然而，现有的CPU、GPU等硬件计算单元通常只支持对称计算模式，并不兼容这种混合精度的矩阵乘法。微软亚洲研究院的研究员们针对这一问题，提出了创新解决方案。

Ladder技术能够将硬件不支持的数据类型无损转换为硬件支持的数据类型指令，使得硬件能够支持混合精度的DNN计算。T-MAC算法则基于查找表（LUT）的方法，实现了硬件对混合精度矩阵乘法的直接支持，在软件层面取得了更好的加速效果。LUT Tensor Core硬件架构的提出，为下一代人工智能硬件设计打开了新思路。

这些技术的应用，不仅提升了大模型在端侧设备的推理效率，还推动了人工智能硬件设计的革新，为未来的智能设备和应用提供了强大的计算支持。随着这些技术的不断成熟和普及，我们可以预见，未来的智能设备将更加智能和高效，为用户带来更加丰富多彩的体验。

英语如下：

News Title: “New Trend in Large Models: Hardware Acceleration of Mixed Matrix Multiplication”

Keywords: End-to-End Computing Power, Mixed Matrix, Model Parameters

News Content:
The rapid development of artificial intelligence has given rise to a demand for high-performance computing devices, especially as the trend of increasing model parameters leads to rising demands for computing power and memory in terminal devices. To address this challenge, Microsoft Asia Research Institute has introduced a series of innovative technologies aimed at enabling existing hardware devices to directly support mixed-precision matrix multiplication, thereby improving the inference efficiency of large models.

Researchers at Microsoft Asia Research Institute developed the data compiler Ladder and the algorithm T-MAC, which can enable hardware devices that only support symmetric precision computation to run mixed-precision matrix multiplication directly. Test results show that Ladder can achieve a maximum speedup of 14.6 times when supporting custom data types that GPUs originally did not support; T-MAC, when installed on the latest Snapdragon X Elite chipset-equipped Surface AI PC, doubles the throughput of large models on the CPU compared to dedicated accelerators like NPU. In addition, the researchers have designed the LUT Tensor Core hardware architecture, which simplifies the design to allow hardware to directly support various low-bit mixed-precision computations, offering a new approach for artificial intelligence hardware design.

As low-bit quantization technology advances, the deployment and inference of large models on edge devices increasingly rely on mixed-precision matrix multiplication. However, existing hardware computing units such as CPUs and GPUs typically only support symmetric computation modes and are not compatible with mixed-precision matrix multiplication. Researchers at Microsoft Asia Research Institute have proposed innovative solutions to this problem.

The Ladder technology can losslessly convert data types unsupported by hardware into hardware-supported data type instructions, enabling hardware to support mixed-precision DNN computation. The T-MAC algorithm, based on a lookup table (LUT) method, directly supports hardware for mixed-precision matrix multiplication, achieving better acceleration effects at the software level. The proposal of the LUT Tensor Core hardware architecture opens up new ideas for the design of next-generation artificial intelligence hardware.

The application of these technologies not only enhances the inference efficiency of large models on edge devices but also drives the innovation of artificial intelligence hardware design, providing powerful computational support for future intelligent devices and applications. As these technologies continue to mature and become widespread, we can foresee that future intelligent devices will become more intelligent and efficient, bringing users a more colorful experience.

【来源】https://www.jiqizhixin.com/articles/2024-08-19-3