Distil Labs launched VoiceTeller, a banking voice assistant replacing cloud LLM with fine-tuned Qwen3-0.6B, hitting 90.9% tool call accuracy vs. 87.5% for 120B teacher. Latency drops to 40ms for brain stage, total pipeline ~315ms locally on Apple Silicon. Open-source code, training data, and GGUF model released.
Key Points
- 1.Fine-tuned Qwen3-0.6B achieves 90.9% single-turn tool call accuracy, beating 120B GPT-oss at 87.5%
- 2.Brain stage latency reduced from 375-750ms to 40ms, enabling natural conversation flow
- 3.Full local pipeline: Qwen3-ASR, llama.cpp for intent, Qwen3-TTS on Apple Silicon MPS
- 4.SLM outputs structured JSON; orchestrator manages multi-turn dialogue and templates
- 5.GitHub repo includes code, data; HF hosts pre-trained GGUF model
Impact Analysis
Demonstrates SLMs excel in structured voice tasks, slashing costs and latency for edge banking apps. Enables offline, private voice AI without cloud reliance. Sparks adoption of tiny models in production voice pipelines.
Technical Details
Model fine-tuned for JSON tool calls (function + slots) only, no free-text generation. Uses llama.cpp for inference; deterministic logic bounds multi-turn handling. Base Qwen3-0.6B at 48.7% accuracy, highlighting fine-tuning necessity.




