上海枫泾古镇正门_20240824上海枫泾古镇正门_20240824

Introduction

Personalizing automated speech recognition (ASR) systems to enhance their performance for individual users, especially with limited speaker data, presents a significant challenge. Low-rank adaptation (LoRA) has emerged as a powerful technique for refining large language models, and a variant, Weight-Decomposed Low-Rank Adaptation (DoRA), promises even greater performance enhancements. This paper explores the application of LoRA and DoRA to improve the cascaded conformer transducer model for speaker personalization, focusing on the development of efficient and effective speaker-specific ASR systems.

Background

When fine-tuning ASR models with limited speaker data, the selection of hyperparameters is crucial. LoRA addresses this by decomposing the weight matrix into a low-rank approximation, allowing for targeted adjustments without the need for extensive data. The proposed method involves adding a small number of speaker-specific weights to the existing model, which are then fine-tuned to enhance recognition accuracy. This approach is particularly valuable in real-world applications where data is scarce.

Proposed Approach

Two variations of low-rank adaptations are proposed for speaker personalization:

  1. LoRA-based Speaker Personalization (Proposed 1): In this method, a small set of parameters is added to the weight matrix of the model. The weight matrix is updated using a low-rank decomposition, where the original matrix is represented as the sum of the base matrix and a low-rank matrix, facilitating targeted adjustments to improve recognition performance for specific speakers.

  2. DoRA-based Speaker Personalization (Proposed 2): Building upon the principles of LoRA, DoRA further decomposes the low-rank matrix into two components, offering a more nuanced approach to weight modification. This decomposition allows for more precise control over the model’s adaptation to individual speaker characteristics, potentially leading to even greater improvements in word error rate.

Results

Experimental assessments on the proposed methods have shown significant improvements in word error rate (WER) across speakers with limited data. The average relative improvement reached 20%, demonstrating the efficacy of LoRA and DoRA in enhancing speaker personalization within the context of ASR systems. This advancement is particularly impactful for applications requiring high accuracy with minimal data, such as voice assistants and personal communication systems.

Conclusion

The integration of LoRA and DoRA into the cascaded conformer transducer model has proven to be a valuable strategy for speaker personalization in ASR systems. By enabling targeted adjustments with limited data, these methods significantly improve recognition accuracy, making them indispensable tools for enhancing the performance of voice-based AI technologies in real-world scenarios. Further research and development in this area could lead to even more sophisticated and personalized ASR systems, offering users a more intuitive and efficient interaction experience.


read more

Views: 0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注