665% Speedup via Speculative Decoding

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#speculative-decoding #inference-speed #llama-cppllama.cppllama.cpp devstral qwen-3.6 gemma-2

💡Unlock 665% inference speed on local models with llama.cpp tweaks

⚡ 30-Second TL;DR

What Changed

Devstral small: 665% speed boost with ngram-map-k, n=24, draft 12-48

Why It Matters

Varies by model: Gemma 2x, Qwen 3.6 only 40% initially, improved to 140% with tweaks.

What To Do Next

Test llama.cpp speculative decoding with --spec-type ngram-map-k on your small models for code tasks.

Who should care:Developers & AI Engineers

Key Points

•Devstral small: 665% speed boost with ngram-map-k, n=24, draft 12-48
•Gemma 2 31B: Doubles tokens/sec; Qwen 3.6: 40% to 140% post-tweaks
•Prompt context: 'minor changes in code' affects model differences
•llama.cpp flags: --repeat-penalty 1.0, --spec-type ngram-mod

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•N-gram speculative decoding in llama.cpp functions by utilizing a local cache of previously generated token sequences, allowing the model to predict and verify multiple tokens simultaneously without requiring a secondary draft model.
•The effectiveness of n-gram speculative decoding is highly dependent on the repetitiveness of the target task; it excels in code generation or structured data tasks where local patterns are predictable, but shows diminishing returns in creative or highly stochastic writing.
•The 'ngram-map-k' strategy specifically optimizes the draft phase by mapping the most frequent n-gram sequences found in the prompt or context window, effectively turning the model's own history into a lightweight, zero-overhead draft mechanism.

📊 Competitor Analysis▸ Show

Feature	N-gram Speculative Decoding (llama.cpp)	Traditional Speculative Decoding (Medusa/Eagle)	Distillation-based Draft Models
Draft Model Requirement	None (Self-drafting)	Requires trained small model	Requires trained small model
Memory Overhead	Minimal (N-gram cache)	Moderate (Draft model weights)	Moderate (Draft model weights)
Performance Gain	High (Task dependent)	Consistent (Model dependent)	Consistent (Model dependent)
Complexity	Low (Parameter tuning)	High (Training/Fine-tuning)	Very High (Distillation process)

🛠️ Technical Deep Dive

Mechanism: N-gram speculative decoding bypasses the need for a separate draft model by maintaining a hash map of n-grams (sequences of n tokens) observed in the context or previous output.
Drafting Process: During the draft phase, the system looks up the current token sequence in the n-gram map. If a match is found, it proposes the subsequent tokens as a draft.
Verification: The target model (e.g., Devstral) performs a single forward pass to verify the entire draft sequence in parallel, accepting tokens that match its own probability distribution.
Parameter Sensitivity: The 'ngram-size' (n) dictates the length of the pattern matching; larger values increase precision but decrease the frequency of matches, while 'ngram-map-k' controls the capacity of the cache.

🔮 Future ImplicationsAI analysis grounded in cited sources

N-gram speculative decoding will become the default inference optimization for local code-completion IDE plugins.

The elimination of secondary draft model memory overhead makes it uniquely suited for resource-constrained environments like local developer workstations.

Standard speculative decoding using secondary models will see reduced adoption for general-purpose LLMs.

The zero-training, zero-weight-overhead nature of n-gram methods provides comparable speedups for many tasks without the complexity of managing draft model compatibility.

⏳ Timeline

2023-02

DeepMind introduces the concept of Speculative Decoding for faster LLM inference.

2023-05

llama.cpp adds initial support for speculative decoding using draft models.

2024-03

Implementation of n-gram based speculative decoding in llama.cpp to enable draft-less acceleration.

2025-11

Refinement of ngram-map-k and ngram-mod flags in llama.cpp to improve hit rates on complex codebases.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #speculative-decoding

Same product

Google releases Gemini 3.6 Flash and teases future models

Ars Technica•Jul 21

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗