Workers AI Adds Kimi K2.5 Support

💡Run Kimi K2.5 serverlessly on Cloudflare—optimized inference, lower costs for agents!
⚡ 30-Second TL;DR
What Changed
Workers AI now runs large models like Kimi K2.5
Why It Matters
This update makes serverless AI agent deployment more accessible and cost-effective on Cloudflare, potentially attracting more developers to their edge platform. It positions Workers AI as a competitive option for large model inference without traditional cloud overhead.
What To Do Next
Test deploying Kimi K2.5 on Workers AI via Cloudflare dashboard for your next agent prototype.
🧠 Deep Insight
Web-grounded analysis with 9 cited sources.
🔑 Enhanced Key Takeaways
- •Kimi K2.5 is an open-source model from Moonshot AI with a 1 trillion parameter Mixture-of-Experts (MoE) architecture, activating only 32 billion parameters per token for efficiency[1][2][3].
- •Features native multimodality with MoonViT vision encoder, supporting image, video, PDF, and text inputs up to a 256K token context window[1][2][5].
- •Introduces Agent Swarm technology for self-directed, coordinated multi-agent execution, decomposing complex tasks into parallel sub-tasks[1][2][3].
- •Achieves top benchmarks including 96.1% on AIME 2025 math, 76.8% on SWE-Bench Verified coding, and 2x inference speed via native INT4 quantization[2][5].
🛠️ Technical Deep Dive
- •Mixture-of-Experts (MoE) architecture: 1T total parameters, 32B activated, 384 experts (8 selected per token), 1 shared expert, 61 layers[1][2].
- •Multi-head Latent Attention (MLA) with 64 heads, 7168 hidden dimension, SwiGLU activation, 160K vocabulary[1][2].
- •MoonViT vision encoder (400M parameters) for native vision-language integration; supports RGB images, videos (experimental), PDFs via spatial-temporal pooling[1][2][3].
- •Quantization-Aware Training (QAT) for INT4 quantization, enabling 2x speed over FP16 without accuracy loss; optimized for NVIDIA GPUs[1][2][3].
- •Hardware requirements: Full model ~630GB, needs 4x H200 GPUs or >240GB unified memory for 10+ tokens/s; B200 achieves >40 tokens/s[6].
🔮 Future ImplicationsAI analysis grounded in cited sources
⏳ Timeline
📎 Sources (9)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- build.nvidia.com — Modelcard
- together.ai — Kimi K2 5
- codecademy.com — Kimi K 2 5 Complete Guide to Moonshots AI Model
- artificialanalysis.ai — Kimi K2 5
- Hugging Face — Kimi K2
- unsloth.ai — Kimi K2
- kimi.com — Kimi K2 5
- platform.moonshot.ai — Kimi K2 5 Quickstart
- chatlyai.app — Kimi K2 5 Features and Benchmarks
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Cloudflare Blog ↗