混说神器问世：TTS大模型方言全精通

近日，业界传来好消息，巨人网络AI Lab成功开发出首个支持普通话和多种方言混说的TTS大模型——Bailing-TTS。该模型不仅能够生成高质量的普通话语音，还能模拟河南话、上海话、粤语等方言的语音，为中文语音合成技术带来了新的突破。

Bailing-TTS的研发团队通过构建一个庞大的数据集，涵盖20种方言、超过20万小时的普通话和方言材料，从而训练出了这一先进的TTS模型。这一成就不仅填补了中文方言语音合成的空白，也为促进不同方言区域之间的交流和文化传承提供了技术支持。

为了实现这一目标，团队采用了多项创新技术。首先，他们统一了各方言的token规范，并使得普通话与方言的token有部分重叠，以此为基础，即使在有限的数据条件下，也能实现高质量的方言语音合成。其次，团队提出了基于大规模多模态预训练的精细化token-wise对齐技术，使得文本和语音token的对齐更加精准。此外，他们还设计了一种层次混合专家体系结构，以及层次化的强化学习策略，以增强TTS模型的方言表达能力。

Bailing-TTS的成功研发，不仅展示了人工智能技术在语言处理领域的巨大潜力，也为未来的语音合成技术提供了新的方向。随着技术的不断进步，我们可以期待在不久的将来，人工智能能够更加精准地模拟各种复杂方言，为人们提供更加自然、流畅的交流体验。

英语如下：

News Title: “Multilingual Speech Synthesis Tool Debuts: TTS Model Mastery of All Dialects”

Keywords: TTS, Dialect, Multilingual Speech

News Content:
Recent news from the industry brings good tidings as the AI Lab of Giant Network has successfully developed the first TTS (Text-to-Speech) large model capable of mixed speech in Mandarin and various dialects—Bailing-TTS. This model not only generates high-quality Mandarin speech but also simulates the tones of dialects such as Henan, Shanghainese, and Cantonese, marking a new breakthrough in Chinese speech synthesis technology.

The development team built a massive dataset encompassing 20 dialects and over 200,000 hours of Mandarin and dialect materials to train this advanced TTS model. This achievement not only fills the gap in Chinese dialect speech synthesis but also provides technological support for promoting communication and cultural heritage among different dialect regions.

To achieve this, the team employed several innovative technologies. Firstly, they standardized the token specifications for all dialects and ensured some overlap between Mandarin and dialect tokens, which, as a foundation, enables high-quality dialect speech synthesis even under limited data conditions. Secondly, the team proposed a fine-grained token-wise alignment technique based on large-scale multimodal pre-training, making the alignment between text and speech tokens more precise. Additionally, they designed a hierarchical mixed expert architecture, as well as a hierarchical reinforcement learning strategy, to enhance the TTS model’s ability to express dialects.

The successful development of Bailing-TTS not only showcases the immense potential of AI technology in the field of language processing but also offers a new direction for future speech synthesis technologies. As technology continues to advance, we can look forward to the day when artificial intelligence can more accurately simulate the complexities of various dialects, providing people with a more natural and fluent communication experience.

【来源】https://www.jiqizhixin.com/articles/2024-08-13-4