A Survey on Speech LLMs
https://arxiv.org/pdf/2410.18908
Contents
- Abstract
- Introduction
- LLMs
- SLU
- Challenges of Speech LLM
- Contributions
Abstract
Integrate (1) & (2)
- (1) LLMs
- (2) Spoken Language Understanding (SLU)
Procedures
- Step 1) Audio Feature Extraction
- Step 2) Multimodal Information Fusion
- Step 3) LLM Inference (Speech LLMs)
Results
- (1) Richer audio feature extraction
- (2) End-to-end fusion of audio & text modalities!
1. Introduction
(1) LLMs
LLMs do well on …
- Parsing contextually appropriate sentences
- Maintaining coherence over multiple conversational turns
\(\rightarrow\) Crucial for tasks like “dialogue systems, automatic summarization, machine translation”
Achieved remarkable success in “multimodal” tasks
- e.g., visual question answering, image generation
(2) SLU
Spoken Language Understanding (SLU)
= Interpreting spoken language
- To extract meaning intent, and relevant information beyond simple transcription
- Two steps
- Step 1) Automatic Speech Recognition (ASR)
- Step 2) Natural Language Understanding (NLU)
Modern systems: Adept at …
-
a) Handling diverse accents & languages
-
b) Improving **efficiency & accuracy **in workflows
- e.g., Medical transcription and customer service automation
(3) Challengs of Speech LLM
Challenge 1) “Long-form recognition”
-
Struggles with maintaining context over extended periods
\(\rightarrow\) Accuracy degradation & latency issues in real-time applications.
Challenge 2) “Hotword/keyword recognition”
-
Critical for wake word recognition ( e.g., Hey Siri~! )
-
Faces difficulties in noisy environments
( = Balance btw sensitivity & specificity )
- Especially when hotwords are contextually similar to other phrases
(4) Contributions
- Comprehensive survey analyzing Speech LLMs in the SLU domain
- (1) Development of Speech LLMs
- (2) Model architecture
- (3) Comparative analysis ( vs. Traditional speech models )
- Training methods for aligning speech & text modalities
- Emphasis on the potential development of RL (e.g., DPO, PPO)
- Analyze the LLM’s dormancy when the LLM is applied in the speech domain