Cleo: Fitting full analyst behavior into a 2B model
๐กLearn how to optimize enterprise data analysis agents by using a tiny 2B model with a unified execution harness.
โก 30-Second TL;DR
What Changed
Built on Qwen3.5-2B-Base for resource-constrained environments.
Why It Matters
This project demonstrates that small, specialized models can outperform larger general-purpose models in specific enterprise workflows. It provides a blueprint for developers to build efficient, secure, and cost-effective data analysis agents.
What To Do Next
Clone the Cleo GitHub repository and test the SQL generation harness on your own schema to see if a 2B model meets your production accuracy requirements.
๐ง Deep Insight
Web-grounded analysis with 11 cited sources.
๐ Enhanced Key Takeaways
- โขCleo, built on Qwen3.5-2B, leverages a multimodal foundation model capable of understanding text, images, and even video, and supports over 200 languages, making it versatile for diverse analyst tasks beyond just text-to-SQL.
- โขThe Qwen3.5-2B model features a 'thinking mode' that significantly enhances its reasoning capabilities on complex graduate-level tests, suggesting its potential for advanced analytical problem-solving when fine-tuned for specialized tasks like those performed by an analyst.
- โขDespite its compact 2B parameter size, Qwen3.5-2B boasts a native context length of 262,144 tokens, enabling it to process extensive documents and complex data schemas, which is crucial for comprehensive analyst tasks requiring broad data understanding.
๐ Competitor Analysisโธ Show
While specific benchmarks for the fine-tuned Cleo model are not available, the performance of other small language models (SLMs) and text-to-SQL approaches on the BIRD benchmark provides context for its competitive landscape:
| Model/Approach | Parameters | Execution Accuracy (BIRD Benchmark) | Key Features/Notes |
|---|---|---|---|
| Cleo (based on Qwen3.5-2B) | 2B | Not specified | Open-source, unified harness, live execution evidence, resource-constrained environments. |
| SLM-SQL-0.5B | 0.5B | 61.82% (test set, Aug 2025) | Leverages supervised fine-tuning and reinforcement learning with corrective self-consistency. |
| SLM-SQL-1.5B | 1.5B | 70.49% (test set, Aug 2025) | Achieved significant improvement on BIRD dev set (31.4 points average across models). |
| Arctic-Text2SQL-R1 | 7B, 14B, 32B | State-of-the-art at every scale | Reasoning-first models from Snowflake AI Research, strong generalization and efficiency, excels on BIRD. |
| Gemini-SQL2 (on Gemini 3.1 Pro) | Larger (proprietary) | 80.04% (June 2026) | Leads BIRD leaderboard, built on Gemini 3.1 Pro, focuses on complex, multi-table queries. |
| OpenAI o3-mini | Not specified | 59.52% (dev set) | Out-of-box capability without chain-of-thought or fine-tuning. |
| Gemini 2.5 Pro | Not specified | 66.82% (dev set) | Out-of-box capability without chain-of-thought or fine-tuning. |
| dbt Semantic Layer (with GPT-5.3 Codex) | Not specified | 100% (for well-modeled queries, Apr 2026) | Uses a structured ontology, deterministic query generation, high accuracy for enterprise use. |
Small language models for text-to-SQL are rapidly improving, with models like SLM-SQL demonstrating competitive execution accuracy on benchmarks like BIRD. While larger, proprietary models like Gemini-SQL2 currently lead in raw accuracy, the focus on efficiency and open-source availability for models like Cleo (based on Qwen3.5-2B) addresses different market needs, particularly for resource-constrained environments. The dbt Semantic Layer approach highlights that structured data modeling can significantly boost accuracy, even for advanced LLMs.
๐ ๏ธ Technical Deep Dive
- Qwen3.5-2B, the base model for Cleo, utilizes a hybrid architecture combining Gated Delta Networks and Gated Attention.
- The model consists of 24 layers with a specific pattern: 6ร(3รGated DeltaNet โ FFN โ 1รGated Attention โ FFN).
- It is a dense model, meaning all 2 billion parameters are active during inference, distinguishing it from larger Mixture-of-Experts (MoE) variants within the Qwen 3.5 family.
- The hidden dimension of the language model is 2048, and it supports a native context length of 262,144 tokens.
- Qwen3.5-2B is designed with multi-token prediction training.
- The model is released under the Apache 2.0 license, ensuring its open-source nature.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (11)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ