Interspeech 2024创新：关系性代理损失在音频文本关键词识别中的应用

作者智能小编

9 月 4, 2024 #Samsung, #音频

In the realm of speech recognition technology, the ability to accurately identify pre-specified keywords in audio streams has become crucial for the development of voice-activated devices. This has led to significant research efforts in Keyword Spotting (KWS), which can be broadly categorized into fixed and flexible (or user-defined) systems. This blog post introduces a novel approach to the latter, focusing on text-enrolled flexible KWS, and specifically on a new loss function termed Relational Proxy Loss (RPL).

The Challenge and the Solution

Text-enrolled flexible KWS systems typically face the challenge of optimizing encoders (text and acoustic) using deep metric learning (DML) objectives such as contrastive loss, triplet-based loss, and proxy-based loss. The key challenge is to ensure that acoustic and text embeddings for the same keyword are closer together, while those for different keywords are further apart. This task, often referred to as audio-text based KWS, is pivotal for accurate recognition.

Introducing Relational Proxy Loss (RPL)

The proposed Relational Proxy Loss (RPL) is designed to enhance the performance of text-enrolled flexible KWS by incorporating structural relations between acoustic embeddings (AEs) and text embeddings (TEs). Inspired by Park et al.’s work, RPL leverages the inherent relational information within AEs and TEs to improve the discriminative power of the system.

In contrast to traditional DML-based approaches, RPL specifically utilizes the relational information within the embeddings to better align the representations of audio and text for the same keyword, while ensuring they diverge for different keywords. This is achieved by treating TEs as proxies for their corresponding word classes, assuming that the AEs belonging to the same class follow the same relational patterns.

Benefits and Applications

The adoption of RPL in text-enrolled flexible KWS systems is expected to result in several benefits. It could lead to a more efficient and accurate recognition of keywords, thereby enhancing the user experience of voice-activated devices. The approach is particularly promising for scenarios where users can enroll arbitrary keywords, as it simplifies the enrollment process while maintaining high recognition rates.

Conclusion

Relational Proxy Loss represents a significant advancement in the field of audio-text based Keyword Spotting. By effectively integrating structural relations within embeddings, RPL offers a promising solution to the challenges faced by text-enrolled flexible KWS systems. As such, it could pave the way for more sophisticated and user-friendly voice recognition technologies, enhancing the capabilities of voice-activated devices across various industries.

This innovative approach not only contributes to the ongoing research in speech recognition but also highlights the importance of relational understanding in deep learning models for audio processing tasks.