Cleo: Fitting full analyst behavior into a 2B model

🔑 Enhanced Key Takeaways

•Cleo, built on Qwen3.5-2B, leverages a multimodal foundation model capable of understanding text, images, and even video, and supports over 200 languages, making it versatile for diverse analyst tasks beyond just text-to-SQL.
•The Qwen3.5-2B model features a 'thinking mode' that significantly enhances its reasoning capabilities on complex graduate-level tests, suggesting its potential for advanced analytical problem-solving when fine-tuned for specialized tasks like those performed by an analyst.
•Despite its compact 2B parameter size, Qwen3.5-2B boasts a native context length of 262,144 tokens, enabling it to process extensive documents and complex data schemas, which is crucial for comprehensive analyst tasks requiring broad data understanding.

📊 Competitor Analysis▸ Show

While specific benchmarks for the fine-tuned Cleo model are not available, the performance of other small language models (SLMs) and text-to-SQL approaches on the BIRD benchmark provides context for its competitive landscape:

Model/Approach	Parameters	Execution Accuracy (BIRD Benchmark)	Key Features/Notes
Cleo (based on Qwen3.5-2B)	2B	Not specified	Open-source, unified harness, live execution evidence, resource-constrained environments.
SLM-SQL-0.5B	0.5B	61.82% (test set, Aug 2025)	Leverages supervised fine-tuning and reinforcement learning with corrective self-consistency.
SLM-SQL-1.5B	1.5B	70.49% (test set, Aug 2025)	Achieved significant improvement on BIRD dev set (31.4 points average across models).
Arctic-Text2SQL-R1	7B, 14B, 32B	State-of-the-art at every scale	Reasoning-first models from Snowflake AI Research, strong generalization and efficiency, excels on BIRD.
Gemini-SQL2 (on Gemini 3.1 Pro)	Larger (proprietary)	80.04% (June 2026)	Leads BIRD leaderboard, built on Gemini 3.1 Pro, focuses on complex, multi-table queries.
OpenAI o3-mini	Not specified	59.52% (dev set)	Out-of-box capability without chain-of-thought or fine-tuning.
Gemini 2.5 Pro	Not specified	66.82% (dev set)	Out-of-box capability without chain-of-thought or fine-tuning.
dbt Semantic Layer (with GPT-5.3 Codex)	Not specified	100% (for well-modeled queries, Apr 2026)	Uses a structured ontology, deterministic query generation, high accuracy for enterprise use.

Small language models for text-to-SQL are rapidly improving, with models like SLM-SQL demonstrating competitive execution accuracy on benchmarks like BIRD. While larger, proprietary models like Gemini-SQL2 currently lead in raw accuracy, the focus on efficiency and open-source availability for models like Cleo (based on Qwen3.5-2B) addresses different market needs, particularly for resource-constrained environments. The dbt Semantic Layer approach highlights that structured data modeling can significantly boost accuracy, even for advanced LLMs.

🛠️ Technical Deep Dive

Qwen3.5-2B, the base model for Cleo, utilizes a hybrid architecture combining Gated Delta Networks and Gated Attention.
The model consists of 24 layers with a specific pattern: 6×(3×Gated DeltaNet → FFN → 1×Gated Attention → FFN).
It is a dense model, meaning all 2 billion parameters are active during inference, distinguishing it from larger Mixture-of-Experts (MoE) variants within the Qwen 3.5 family.
The hidden dimension of the language model is 2048, and it supports a native context length of 262,144 tokens.
Qwen3.5-2B is designed with multi-token prediction training.
The model is released under the Apache 2.0 license, ensuring its open-source nature.

🔮 Future ImplicationsAI analysis grounded in cited sources

Small, specialized models like Cleo will democratize advanced data analysis.

Their ability to run in resource-constrained environments and perform complex tasks makes sophisticated analytical tools accessible to a wider range of users and organizations, reducing reliance on expensive, large-scale models.

The 'live execution evidence' approach will become a standard for robust text-to-SQL solutions.

Validating generated SQL queries through actual execution significantly reduces errors and increases reliability compared to relying solely on model likelihood, leading to more trustworthy automated data analysis.

⏳ Timeline

2026-02

Qwen 3.5 foundation model family, including Qwen3.5-2B, released by Alibaba Cloud.

2026-03-09

Qwen/Qwen3.5-2B and Qwen/Qwen3.5-2B-Base model weights and configuration files released on Hugging Face.

Cleo: Fitting full analyst behavior into a 2B model

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

📎 Sources (11)

👉Related Updates

Universal Aesthetic Alignment Can Override User Intent