Sommelier: Open Pipeline for Full-Duplex SLMs

๐กOpen-source fix for multi-speaker audio data scarcity in full-duplex speech AI
โก 30-Second TL;DR
What Changed
Scalable open-source pipeline for multi-turn audio preprocessing
Why It Matters
Enables researchers to generate high-quality training data at scale for conversational AI, accelerating full-duplex model development. Lowers barriers for natural dialogue systems, potentially improving real-time interaction quality across voice assistants and agents.
What To Do Next
Download Sommelier from arXiv-linked repo and preprocess your multi-turn audio for SLM fine-tuning.
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขSommelier utilizes a novel 'Audio-Text Alignment' (ATA) module that leverages cross-modal attention to synchronize asynchronous audio streams, specifically mitigating the latency issues inherent in full-duplex streaming.
- โขThe pipeline integrates a proprietary 'Diarization-Aware Tokenization' (DAT) layer, which explicitly encodes speaker-turn boundaries into the token stream to prevent the model from conflating overlapping speakers.
- โขSommelier provides a standardized evaluation benchmark, the 'Duplex-Eval Suite,' which measures turn-taking latency and interruption handling, metrics previously lacking in standard ASR/TTS benchmarks.
๐ Competitor Analysisโธ Show
| Feature | Sommelier | Whisper-Streaming (OpenAI) | Audio-LLM Pipelines (General) |
|---|---|---|---|
| Full-Duplex Support | Native/Built-in | Limited (Requires external logic) | Varies (Usually requires custom glue) |
| Overlapping Speech | High (Diarization-Aware) | Low (Often merges speakers) | Moderate (Dependent on model) |
| Latency | Ultra-low (Streaming-optimized) | Moderate | High |
| Pricing | Open Source (Apache 2.0) | API-based / Closed | Varies |
| Benchmarks | Duplex-Eval Suite | Standard ASR/WER | Standard ASR/WER |
๐ ๏ธ Technical Deep Dive
- Architecture: Employs a dual-stream encoder architecture where audio and text tokens are processed in parallel before being fused via a cross-modal attention mechanism.
- Preprocessing Pipeline: Includes a multi-stage noise reduction and voice activity detection (VAD) filter that operates at the frame level (10ms windows) to maintain real-time performance.
- Handling Back-channeling: Uses a specialized 'Interjection-Detection' head that identifies non-lexical conversational fillers (e.g., 'mhm', 'yeah') to prevent the model from treating them as primary input triggers.
- ASR Hallucination Mitigation: Implements a confidence-scoring layer that masks low-probability tokens during high-noise segments, preventing the propagation of erroneous text into the LLM context window.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: ArXiv AI โ