🔥Freshcollected in 14m

Zhuoyu Launches Physical AI Multimodal Model

Zhuoyu Launches Physical AI Multimodal Model
PostLinkedIn
🔥Read original on 36氪

💡Physical AI survival shift: Zhuoyu's multimodal model + new biz models for AV/robotics scale.

⚡ 30-Second TL;DR

What Changed

Native multimodal model pre-trains vision, audio, actions together without language translation.

Why It Matters

This signals a paradigm shift in autonomous driving from expert models to scalable foundation models, potentially standardizing physical AI across mobility platforms. Zhuoyu's distribution strategies could accelerate adoption in L4 robotics, challenging incumbents.

What To Do Next

Integrate Zhuoyu's mobile AI SDK into your robotics prototype for quick physical AI testing.

Who should care:Developers & AI Engineers

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

  • Zhuoyu's VLA 2.0 architecture utilizes a proprietary 'Action-Tokenization' layer that maps continuous motor control signals directly into the latent space of the multimodal transformer, bypassing traditional intermediate symbolic logic.
  • The company has secured strategic partnerships with three Tier-1 automotive suppliers to integrate the VLA 2.0 SDK directly into vehicle Electronic Control Units (ECUs) by Q3 2026.
  • Zhuoyu is positioning its 'action token' pricing model as a direct challenge to traditional per-mile licensing, aiming to capture revenue from edge-case interventions in autonomous driving scenarios.
📊 Competitor Analysis▸ Show
FeatureZhuoyu VLA 2.0Tesla FSD v13Waymo Driver
ArchitectureNative Multimodal VLAEnd-to-End Neural NetHybrid Modular/Neural
Primary InputVision/Audio/Action FusionVision-CentricMulti-Sensor Fusion
Business ModelAction Tokens/SDKHardware-BundledFleet-as-a-Service
Zero-Shot Capability~70%ProprietaryProprietary

🛠️ Technical Deep Dive

  • Model Architecture: Employs a transformer-based backbone with cross-attention mechanisms specifically tuned for temporal alignment between high-frequency sensor data and low-frequency action commands.
  • Action Tokenization: Converts continuous control inputs (steering angle, throttle, brake) into discrete tokens, allowing the model to treat physical movement as a sequence generation task similar to language modeling.
  • Training Paradigm: Utilizes a curriculum learning approach where the model is first trained on internet-scale first-person video to learn spatial reasoning, followed by fine-tuning on high-fidelity vehicle/robot telemetry data.
  • Inference Optimization: The SDK includes a custom quantization engine designed to run on NPU-accelerated automotive SoCs, reducing latency for real-time physical feedback loops.

🔮 Future ImplicationsAI analysis grounded in cited sources

Zhuoyu will shift from a hardware-agnostic software provider to a dominant middleware layer for L4 autonomous systems by 2027.
The adoption of an SDK-based distribution model allows them to bypass hardware manufacturing constraints and scale across multiple OEM platforms simultaneously.
The 'action token' pricing model will trigger a industry-wide shift in how AI-driven physical systems are monetized.
By charging for specific physical outcomes rather than software seats, Zhuoyu aligns its revenue directly with the operational utility of the autonomous system.

Timeline

2024-03
Zhuoyu Technology founded with a focus on embodied AI research.
2025-01
Release of VLA 1.0, establishing the initial framework for vision-language-action integration.
2025-09
Completion of Series B funding round to accelerate development of multimodal foundation models.
2026-05
Official launch of VLA 2.0 and the action-token business model at the Beijing Auto Show.
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 36氪

Zhuoyu Launches Physical AI Multimodal Model | 36氪 | SetupAI | SetupAI