๐Ÿค–Recentcollected in 39m

Cleo: Fitting full analyst behavior into a 2B model

PostLinkedIn
๐Ÿค–Read original on Reddit r/MachineLearning

๐Ÿ’กLearn how to optimize enterprise data analysis agents by using a tiny 2B model with a unified execution harness.

โšก 30-Second TL;DR

What Changed

Built on Qwen3.5-2B-Base for resource-constrained environments.

Why It Matters

This project demonstrates that small, specialized models can outperform larger general-purpose models in specific enterprise workflows. It provides a blueprint for developers to build efficient, secure, and cost-effective data analysis agents.

What To Do Next

Clone the Cleo GitHub repository and test the SQL generation harness on your own schema to see if a 2B model meets your production accuracy requirements.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 11 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขCleo, built on Qwen3.5-2B, leverages a multimodal foundation model capable of understanding text, images, and even video, and supports over 200 languages, making it versatile for diverse analyst tasks beyond just text-to-SQL.
  • โ€ขThe Qwen3.5-2B model features a 'thinking mode' that significantly enhances its reasoning capabilities on complex graduate-level tests, suggesting its potential for advanced analytical problem-solving when fine-tuned for specialized tasks like those performed by an analyst.
  • โ€ขDespite its compact 2B parameter size, Qwen3.5-2B boasts a native context length of 262,144 tokens, enabling it to process extensive documents and complex data schemas, which is crucial for comprehensive analyst tasks requiring broad data understanding.
๐Ÿ“Š Competitor Analysisโ–ธ Show

While specific benchmarks for the fine-tuned Cleo model are not available, the performance of other small language models (SLMs) and text-to-SQL approaches on the BIRD benchmark provides context for its competitive landscape:

Model/ApproachParametersExecution Accuracy (BIRD Benchmark)Key Features/Notes
Cleo (based on Qwen3.5-2B)2BNot specifiedOpen-source, unified harness, live execution evidence, resource-constrained environments.
SLM-SQL-0.5B0.5B61.82% (test set, Aug 2025)Leverages supervised fine-tuning and reinforcement learning with corrective self-consistency.
SLM-SQL-1.5B1.5B70.49% (test set, Aug 2025)Achieved significant improvement on BIRD dev set (31.4 points average across models).
Arctic-Text2SQL-R17B, 14B, 32BState-of-the-art at every scaleReasoning-first models from Snowflake AI Research, strong generalization and efficiency, excels on BIRD.
Gemini-SQL2 (on Gemini 3.1 Pro)Larger (proprietary)80.04% (June 2026)Leads BIRD leaderboard, built on Gemini 3.1 Pro, focuses on complex, multi-table queries.
OpenAI o3-miniNot specified59.52% (dev set)Out-of-box capability without chain-of-thought or fine-tuning.
Gemini 2.5 ProNot specified66.82% (dev set)Out-of-box capability without chain-of-thought or fine-tuning.
dbt Semantic Layer (with GPT-5.3 Codex)Not specified100% (for well-modeled queries, Apr 2026)Uses a structured ontology, deterministic query generation, high accuracy for enterprise use.

Small language models for text-to-SQL are rapidly improving, with models like SLM-SQL demonstrating competitive execution accuracy on benchmarks like BIRD. While larger, proprietary models like Gemini-SQL2 currently lead in raw accuracy, the focus on efficiency and open-source availability for models like Cleo (based on Qwen3.5-2B) addresses different market needs, particularly for resource-constrained environments. The dbt Semantic Layer approach highlights that structured data modeling can significantly boost accuracy, even for advanced LLMs.

๐Ÿ› ๏ธ Technical Deep Dive

  • Qwen3.5-2B, the base model for Cleo, utilizes a hybrid architecture combining Gated Delta Networks and Gated Attention.
  • The model consists of 24 layers with a specific pattern: 6ร—(3ร—Gated DeltaNet โ†’ FFN โ†’ 1ร—Gated Attention โ†’ FFN).
  • It is a dense model, meaning all 2 billion parameters are active during inference, distinguishing it from larger Mixture-of-Experts (MoE) variants within the Qwen 3.5 family.
  • The hidden dimension of the language model is 2048, and it supports a native context length of 262,144 tokens.
  • Qwen3.5-2B is designed with multi-token prediction training.
  • The model is released under the Apache 2.0 license, ensuring its open-source nature.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Small, specialized models like Cleo will democratize advanced data analysis.
Their ability to run in resource-constrained environments and perform complex tasks makes sophisticated analytical tools accessible to a wider range of users and organizations, reducing reliance on expensive, large-scale models.
The 'live execution evidence' approach will become a standard for robust text-to-SQL solutions.
Validating generated SQL queries through actual execution significantly reduces errors and increases reliability compared to relying solely on model likelihood, leading to more trustworthy automated data analysis.

โณ Timeline

2026-02
Qwen 3.5 foundation model family, including Qwen3.5-2B, released by Alibaba Cloud.
2026-03-09
Qwen/Qwen3.5-2B and Qwen/Qwen3.5-2B-Base model weights and configuration files released on Hugging Face.

๐Ÿ“Ž Sources (11)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

  1. apxml.com
  2. youtube.com
  3. ollama.com
  4. artificialanalysis.ai
  5. huggingface.co
  6. lmstudio.ai
  7. github.com
  8. snowflake.com
  9. the-decoder.com
  10. arxiv.org
  11. getdbt.com
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ†—