Language-Action Pre-training (LAP) represents robot actions in natural language for zero-shot transfer across embodiments without fine-tuning. LAP-3B, a 3B VLA, delivers over 50% success on novel robots and tasks. Enables efficient adaptation and unifies action prediction with VQA.
Key Points
- 1.No tokenizer, annotation, or embodiment-specific design needed
- 2.Aligns actions with vision-language model distributions
- 3.2x improvement over prior VLAs in zero-shot success
- 4.Supports co-training for gains
Impact Analysis
Pushes toward generalist robotics policies deployable on unseen hardware. Accelerates real-world robot deployment by reducing adaptation costs.
Technical Details
Encodes low-level actions directly in language. Pre-trains on multi-embodiment data. Scales favorably with unified language-action format.