[Beijing, China] – iFLYTEK, in collaboration with Huawei, has announced a significant advancement in the field of domestic computing power, achieving large-scale cross-node expert parallel cluster inference of Mixture of Experts (MoE) models on a domestic computing power cluster. This breakthrough marks a crucial step towards independent and controllable AI infrastructure in China.
The joint team from iFLYTEK and Huawei achieved this milestone through deep software and hardware co-innovation. They successfully validated and deployed the solution on the Ascend cluster, resulting in significant performance improvements. This achievement positions them as the first in the industry to offer a complete solution based on domestic computing power, following DeepSeek’s announcement of its MoE model training and inference plan.
Key Technical Innovations:
- Operator Fusion: The team implemented parallel processing of Vector and Cube heterogeneous computing units during the MLA preprocessing stage. By fusing multiple small operators into atomic-level computing units, they eliminated the overhead associated with issuing small operators, reducing MLA preprocessing latency by over 50%.
- Hybrid Parallel Strategy and Communication Optimization: A hybrid TP (Tensor Parallelism) + EP (Expert Parallelism) paradigm was constructed. TP parallelism was employed within the machine for the MLA computing layer to leverage high-speed interconnect advantages and reduce cross-machine communication losses. Innovative MoE expert layered scheduling was implemented, evenly distributing expert computing nodes across 64 cards. A customized AllToAll communication protocol was developed, improving expert data exchange efficiency by 40%. A dual-layer communication architecture (cross-machine/intra-machine) was built, reducing cross-machine traffic by 60% through layered optimization.
- Load Balancing: A routing expert load balancing algorithm was developed, achieving a load difference of less than 10% between cards, increasing cluster throughput by 30%.
Performance Gains:
These innovations resulted in significant performance improvements on domestic computing power:
- Single-card static memory usage was reduced to 1/4 of the dual-machine deployment, improving efficiency by 75%.
- Expert computing density increased by 4 times.
- Inference throughput increased by 3.2 times.
- End-to-end latency decreased by 50%.
Impact and Applications:
This breakthrough solution will be applied to accelerate the training of iFLYTEK Spark deep inference models, with an expected 200% increase in training inference efficiency. Furthermore, the inference engine based on this solution enables efficient inference of DeepSeek V3 and R1 on domestic computing power.
iFLYTEK recently upgraded its Spark X1 deep inference model, achieving leading performance in various Chinese mathematical tasks, fully benchmarking DeepSeek R1 and OpenAI o1, despite having one order of magnitude fewer model parameters than its industry peers. iFLYTEK emphasizes its commitment to the domestic ecosystem, highlighting that Spark X1 is currently the only deep inference model trained using all-domestic computing power.
iFLYTEK plans to continuously iterate and upgrade the Spark large model and accelerate the SparkAPI for developers through the inference engine. This includes the Spark large model series API and related model APIs open-sourced on the Star MaaS platform.
References:
- iFLYTEK RESEARCH. (2024). 科大讯飞联合华为率先实现国产算力大规模跨节点专家并行集群推理 [iFLYTEK and Huawei Achieve Breakthrough in Domestic Computing Power with Large-Scale Parallel Inference]. Retrieved from [Insert original source URL here if available]
Views: 0