代码成LLM“新六艺”：数据加点，能力倍增

##代码知识原来这么重要！大模型训练中代码数据的影响力不容忽视

近年来，大语言模型（LLM）在各个领域展现出强大的能力，写代码能力更是成为了其“君子六艺”中不可或缺的一项。然而，代码数据在通用 LLM 性能中的具体影响一直缺乏深入研究。近日，Cohere 等机构发布的一项研究成果揭示了代码数据对大模型性能的深远影响，为LLM 的训练提供了新的思路。

研究人员通过对范围广泛的自然语言推理任务、世界知识任务、代码基准和 LLM-as-a-judge 胜率进行评估，发现代码数据对非代码任务的性能提升效果显著。使用代码预训练模型进行初始化，能够显著提高自然语言任务的性能，例如自然语言推理能力提升 8.2%，世界知识提升 4.2%，生成胜率提高6.6%，代码性能更是提升了 12 倍。

这项研究还强调了代码质量的重要性。研究人员发现，使用标记样式的编程语言、代码相邻数据集（例如 GitHub commits）和合成生成的代码，能够进一步提高预训练的性能。例如，在更高质量的合成生成的代码数据集上进行训练，可以使自然语言推理和代码性能分别提高 9% 和 44%。

此外，研究人员还发现，在预训练冷却阶段包含代码数据，能够进一步改善所有任务的性能。与不包含代码数据的冷却模型相比，包含代码数据的冷却模型在自然语言推理、世界知识和代码性能方面分别提升了 3.6%、10.1% 和 20%。

这项研究结果表明，代码是泛化的关键构建块，远远超出了编码任务的范围。代码质量的提高对性能有巨大的影响，在预训练阶段投资代码质量和保留代码数据，能够产生积极影响。

该研究为 LLM 的训练提供了新的思路，也为未来 LLM 的发展指明了方向。未来，随着代码数据的不断积累和研究的深入，相信 LLM 的能力将得到进一步提升，为人类社会带来更多益处。

英语如下：

##Code: The New “Six Arts” for LLMs: Data Augmentation, Ability Multiplication

**Keywords:** Code, Large Language Models, Importance

**News Content:**

## Code Knowledge is Crucial! The Impact of Code Data in LLM Training Cannot be Ignored

In recent years, large language models(LLMs) have demonstrated powerful capabilities across various domains, with code generation becoming an indispensable part of their “Six Arts.” However, the specific impact of codedata on the performance of general-purpose LLMs has been lacking in-depth research. Recently, a study published by Cohere and other institutions revealed the profound influence of code data on LLM performance, providing new insights for LLM training.

Researchers evaluated a wide range of natural language inference tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win rates. They discovered that code data significantly improved performance on non-code tasks. Initializingwith code pre-trained models significantly enhanced performance in natural language tasks, such as an 8.2% increase in natural language inference ability, a 4.2% improvement in world knowledge, a 6.6% increase in generation win rate, and a 12-fold improvement in code performance.

This study also highlighted the importance of code quality. Researchers found that using tagged programming languages, code-adjacent datasets (e.g., GitHub commits), and synthetically generated code further improved pre-training performance. For instance, training on higher-quality synthetically generated code datasets resulted in a 9% and44% improvement in natural language inference and code performance, respectively.

Furthermore, researchers discovered that including code data in the pre-training cooling phase further improved performance across all tasks. Compared to cooling models without code data, those with code data showed a 3.6%, 10.1%, and20% improvement in natural language inference, world knowledge, and code performance, respectively.

These research findings indicate that code is a crucial building block for generalization, extending far beyond coding tasks. Enhanced code quality significantly impacts performance, and investing in code quality and retaining code data during pre-training can yield positive results.

This study provides new insights into LLM training and points the way for future LLM development. As code data continues to accumulate and research deepens, LLMs are expected to further enhance their capabilities, bringing more benefits to human society.

【来源】https://www.jiqizhixin.com/articles/2024-08-22-7