字节大模型同传智能体：媲美人类精准实时翻译

随着人工智能（AI）的迅速发展，尤其是大型语言模型（LLMs）在自然语言处理任务中的卓越表现，AI同声传译的难题逐渐受到广泛关注。传统同声传译软件采用级联模型，即先进行自动语音识别（ASR）后进行机器翻译（MT），然而这一方法存在显著的错误传播问题，且受限于低延迟要求，只能使用性能较差的小模型，难以应对复杂多变的实际应用场景。

近期，字节跳动ByteDance Research团队推出了端到端同声传译智能体——Cross Language Agent – Simultaneous Interpretation (CLASI)，其翻译效果已接近专业人工水平，展现了巨大的潜力和先进科技能力。CLASI采用端到端架构，解决了级联模型中的错误传播问题，结合豆包基座大模型和豆包大模型语音组的强大语音理解能力，以及从外部获取知识的能力，最终形成了媲美人类水平的同声传译系统。

在实际应用中，CLASI在处理语速超快、发音复杂的绕口令、精妙绝伦的文言文，以及充满即兴和灵感的随意聊天时，都能流畅自然地给出准确而地道的翻译结果。尤其是在会议场景翻译中，CLASI表现出色，展现了其在不同领域和复杂场景下的适应性。

研究团队在中英、英中翻译语向上，邀请专业同传译员进行人工评测，使用与人工同传一致的评价指标——有效信息占比，结果显示CLASI系统大幅领先商业系统和开源SOTA系统，在某些测试集上甚至达到或超过了人类同传水平。

CLASI系统架构基于大型语言模型智能体，定义同声传译为一系列简单且协调的操作，包括读入音频流、检索（可选）、读取记忆体、更新记忆体、输出等。整个流程由大语言模型自主控制，实现了高效传递信息与保持翻译内容准确性和连贯性的平衡。底层模型是基于编码器的大型语言模型，在海量无监督和有监督数据上进行了预训练。

随着CLASI智能体的推出，AI同声传译领域迎来新的里程碑，预示着AI在翻译领域实现与人类水平媲美的潜力。这不仅为国际会议、外交交流、教育等多个领域提供更高效、更精准的翻译服务，同时也为AI与人类语言交流的融合开辟了新的道路。

英语如下：

Headline: ByteDance’s Big Model AI Interpreter: Matching Human Precision in Real-time Translation

Keywords: Big Model Simultaneous Interpretation, AI Interpreter, Paralleling Human Translation

News Content: Headline: ByteDance’s Big Model AI Interpreter: Overcoming Translation Challenges, Achieving Human-Level Simultaneous Interpretation

With the rapid advancement of artificial intelligence (AI), particularly the outstanding performance of large language models (LLMs) in natural language processing tasks, the challenges of AI simultaneous interpretation have garnered significant attention. Traditional simultaneous interpretation software employs a cascading model, involving automatic speech recognition (ASR) followed by machine translation (MT), which is marred by significant error propagation issues. Moreover, due to the stringent low-latency requirements, it relies on less capable small models, making it difficult to handle the complexities and variability of real-world scenarios.

Recently, ByteDance Research has introduced an end-to-end simultaneous interpretation AI interpreter, Cross Language Agent – Simultaneous Interpretation (CLASI), which has achieved translation quality nearly on par with professional human interpreters, showcasing its vast potential and advanced technological capabilities. CLASI adopts an end-to-end architecture, addressing the issue of error propagation in cascading models. It leverages the powerful voice understanding capabilities of the DouBan base large model and the DouBan large model speech ensemble, along with the ability to acquire knowledge from external sources, culminating in a system that matches human-level simultaneous interpretation.

In practical applications, CLASI has demonstrated its proficiency in providing accurate and idiomatic translations for rapid speech, complex tongue twisters, intricate classical Chinese, and spontaneous, creative conversations. Particularly in meeting scenarios, CLASI has shown exceptional performance, highlighting its adaptability across various domains and complex contexts.

The research team conducted human evaluations in English-Chinese and Chinese-English translation directions, using consistent evaluation criteria with professional simultaneous interpreters, such as the proportion of effective information. The results showed that the CLASI system significantly outperformed commercial systems and open-source state-of-the-art (SOTA) systems, reaching or surpassing human-level simultaneous interpretation in some test sets.

The CLASI system architecture is grounded in a large model AI interpreter, defining simultaneous interpretation as a series of simple and coordinated operations, including audio stream input, retrieval (optional), memory reading, memory updating, and output. The entire process is autonomously controlled by the large language model, achieving a balance between efficient information transmission and maintaining accuracy and coherence in translation. The underlying model is a large language model based on an encoder, pre-trained on vast amounts of unsupervised and supervised data.

With the introduction of the CLASI AI interpreter, the field of AI simultaneous interpretation marks a new milestone, indicating the potential for AI to match human-level performance in translation. This not only provides more efficient and precise translation services for international conferences, diplomatic exchanges, and education, among other fields, but also paves the way for the integration of AI and human language interaction.

【来源】https://www.jiqizhixin.com/articles/2024-07-25-8