A Survey on Speech LLMs

https://arxiv.org/pdf/2410.18908

Abstract

Integrate (1) & (2)

Procedures

Results

LLMs do well on …

\(\rightarrow\) Crucial for tasks like “dialogue systems, automatic summarization, machine translation”

Achieved remarkable success in “multimodal” tasks

Spoken Language Understanding (SLU)

= Interpreting spoken language

To extract meaning intent, and relevant information beyond simple transcription
Two steps
- Step 1) Automatic Speech Recognition (ASR)
- Step 2) Natural Language Understanding (NLU)

Modern systems: Adept at …

a) Handling diverse accents & languages
b) Improving **efficiency & accuracy **in workflows
- e.g., Medical transcription and customer service automation

Challenge 1) “Long-form recognition”

Struggles with maintaining context over extended periods

\(\rightarrow\) Accuracy degradation & latency issues in real-time applications.

Challenge 2) “Hotword/keyword recognition”

Critical for wake word recognition ( e.g., Hey Siri~! )
Faces difficulties in noisy environments

( = Balance btw sensitivity & specificity )
- Especially when hotwords are contextually similar to other phrases

Comprehensive survey analyzing Speech LLMs in the SLU domain
- (1) Development of Speech LLMs
- (2) Model architecture
- (3) Comparative analysis ( vs. Traditional speech models )
Training methods for aligning speech & text modalities
- Emphasis on the potential development of RL (e.g., DPO, PPO)
Analyze the LLM’s dormancy when the LLM is applied in the speech domain