0.6B SLM Tops 120B in Voice Tool Calls

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

💡Tiny 0.6B model beats 120B LLM + 10x faster local voice AI—open-source now

⚡ 30-Second TL;DR

What changed

Fine-tuned Qwen3-0.6B achieves 90.9% single-turn tool call accuracy, beating 120B GPT-oss at 87.5%

Why it matters

Demonstrates SLMs excel in structured voice tasks, slashing costs and latency for edge banking apps. Enables offline, private voice AI without cloud reliance. Sparks adoption of tiny models in production voice pipelines.

What to do next

Clone the GitHub repo and fine-tune Qwen3-0.6B on your domain data using provided scripts.

Who should care:Developers & AI Engineers

Distil Labs launched VoiceTeller, a banking voice assistant replacing cloud LLM with fine-tuned Qwen3-0.6B, hitting 90.9% tool call accuracy vs. 87.5% for 120B teacher. Latency drops to 40ms for brain stage, total pipeline ~315ms locally on Apple Silicon. Open-source code, training data, and GGUF model released.

Key Points

1.Fine-tuned Qwen3-0.6B achieves 90.9% single-turn tool call accuracy, beating 120B GPT-oss at 87.5%
2.Brain stage latency reduced from 375-750ms to 40ms, enabling natural conversation flow
3.Full local pipeline: Qwen3-ASR, llama.cpp for intent, Qwen3-TTS on Apple Silicon MPS
4.SLM outputs structured JSON; orchestrator manages multi-turn dialogue and templates
5.GitHub repo includes code, data; HF hosts pre-trained GGUF model

Impact Analysis

Technical Details

Model fine-tuned for JSON tool calls (function + slots) only, no free-text generation. Uses llama.cpp for inference; deterministic logic bounds multi-turn handling. Base Qwen3-0.6B at 48.7% accuracy, highlighting fine-tuning necessity.

#small-language-model #tool-calling #voice-assistant #local-inferencevoiceteller

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →