循环语言模型新突破：仅读两次提示超越Transformer++

在人工智能领域，语言模型一直是研究的热点。目前，大型语言模型广泛采用的架构是Transformer，但研究人员正在探索其他架构，特别是循环语言模型，以提高在语言建模方面的性能。最近，斯坦福大学和布法罗大学的研究人员发现，通过简单的提示策略和循环架构设计，循环语言模型可以在保持内存不变的情况下，超越Transformer++架构。

研究人员提出了一种名为“仅阅读两次”（Just-read-twice, JRT）的提示策略，以及相应的循环架构。这种策略通过在模型生成答案之前重复上下文信息，使得语言模型能够更有效地存储和利用信息。实验结果显示，JRT策略在多个循环语言模型和上下文学习任务上取得了显著的提升，其吞吐量是FlashAttention-2的11.9倍。

此外，研究人员还提出了JRT-RNN架构，它通过改进训练损失和采用线性注意力公式，提高了循环语言模型的质量和效率。实验表明，JRT-RNN在不同的参数设置下，均提供了显著的质量改进，吞吐量是FlashAttention-2的19.2倍。

这项研究不仅展示了循环语言模型的潜力，也为AI领域提供了一种新的架构设计和训练方法。随着研究的深入，循环语言模型有望在未来的AI应用中发挥更大的作用。

英语如下：

News Title: “New Breakthrough in Recurrent Language Models: Surpassing Transformer++ with Just Two Prompt Reads”

Keywords: Recurrent Surpasses, Memory Constant, JRT Enhancement

News Content: In the field of artificial intelligence, language models have been a focal point of research. Currently, large language models widely adopt the Transformer architecture, but researchers are exploring alternative architectures, particularly recurrent language models, to enhance performance in language modeling. Recently, researchers from Stanford University and the University at Buffalo made a discovery that recurrent language models can surpass the Transformer++ architecture while maintaining memory constancy through a simple prompt strategy and recurrent architectural design.

The researchers introduced a prompt strategy known as “Just-read-twice” (JRT), along with an associated recurrent architecture. This strategy achieves more effective information storage and utilization by repeating context information before the model generates an answer. Experimental results show that the JRT strategy has achieved significant improvements on multiple recurrent language models and context learning tasks, with a throughput 11.9 times that of FlashAttention-2.

Additionally, the researchers proposed the JRT-RNN architecture, which improves the quality and efficiency of recurrent language models by enhancing training loss and adopting a linear attention formula. Experiments demonstrate that the JRT-RNN provides significant quality improvements across different parameter settings, with a throughput 19.2 times that of FlashAttention-2.

This research not only showcases the potential of recurrent language models but also offers a new architecture design and training method for the AI field. As research progresses, recurrent language models are poised to play a more significant role in future AI applications.

【来源】https://www.jiqizhixin.com/articles/2024-08-04-7