⚛️Stalecollected in 22m

SenseTime Ditches VAE/VE in Multimodal Overhaul

SenseTime Ditches VAE/VE in Multimodal Overhaul
PostLinkedIn
⚛️Read original on 量子位

💡SenseTime 2B multimodal kills VAE/VE overhead—new paradigm for efficient models?

⚡ 30-Second TL;DR

What Changed

Eliminates VE and VAE for direct multimodal processing

Why It Matters

This breakthrough enables more efficient smaller-scale multimodal models, potentially lowering training costs and accelerating inference for real-world applications.

What To Do Next

Review SenseTime's technical report on QuantumBit for architecture diagrams and benchmark data.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

  • NEO architecture, open-sourced in collaboration with Nanyang Technological University's S-Lab, innovates in attention mechanisms, positional encoding, and semantic mapping for native vision-language fusion[1][2][4][7].
  • NEO-based models in 2B and 9B sizes achieve top performance with 1/10th the data via Cross-View Prediction training, surpassing GPT-5 and Gemini-3 Pro in spatial intelligence[1][3][4].
  • SenseNova V6.5 ranked No.1 in China for multimodal tasks including facial recognition, 3D object recognition, and medical image analysis in 2025 evaluations[1].

🛠️ Technical Deep Dive

  • NEO uses Pre-Buffer & Post-LLM dual-stage integration to preserve LLM language reasoning while building visual capabilities, avoiding degradation in cross-modal training[2][4].
  • Innovations include unified processing of visual and language data at core architecture level, enabling end-to-end native integration for robotics, video understanding, and 3D interaction[2][4][7].
  • SenseNova-SI leverages spatial capability classification and large-scale diverse data, proving scaling laws in six dimensions: metric measurement, mental reconstruction, spatial relationships, perspective-taking, deformation/assembling, and reasoning[3].
  • Open-sourced NEO models: 2B and 9B specifications, with technical report at arxiv.org/abs/2511.13719 detailing spatial intelligence advancements[3][4][7].

🔮 Future ImplicationsAI analysis grounded in cited sources

NEO will reduce multimodal training costs by 10x industry norms
It achieves comparable performance to peers using only 1/10th the data volume through Cross-View Prediction and native fusion[1][4].
Native architectures like NEO enable edge deployment of advanced multimodal AI
The efficient 2B/9B models support shift from cloud to edge devices for robotics and intelligent terminals[4][7].
SenseNova-SI benchmarks will standardize spatial intelligence evaluation
Open-sourced EASI platform and leaderboard unify standards for open/closed-source models in academia and industry[3].

Timeline

2024-07
SenseTime breaks through native multimodal fusion training, wins SuperCLUE and OpenCompass championships with single model
2025-07
Releases SenseNova 6.5 with early encoder-level fusion, tripling cost-performance ratio for commercial text-image reasoning
2025-12
Open-sources NEO architecture with Nanyang Technological University as world's first scalable native VLM
2025-12
Launches 2B and 9B NEO-based models, redefining multimodal efficiency boundaries
2026-01
Open-sources SenseNova-SI series, outperforming GPT-5 and Gemini-3 Pro in spatial intelligence benchmarks
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 量子位