๐ArXiv AIโขStalecollected in 17h
MiRA Supercharges Open LLM Agents Past GPT-4

๐กOpen model hits 43% SR on WebArena, beats GPT-4o 3xโnew SOTA for agents!
โก 30-Second TL;DR
What Changed
Subgoal decomposition enables adaptive online planning with proprietary LLMs like Gemini (+10% SR).
Why It Matters
Open models now rival or exceed proprietary ones on agent benchmarks, democratizing advanced autonomy. Enables scalable RL for real-world digital environments like browsers and OS.
What To Do Next
Reproduce MiRA on Gemma3-12B using WebArena-Lite to fine-tune your web agents.
Who should care:Researchers & Academics
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขMiRA utilizes a novel 'Reward-Weighted Regression' (RWR) variant that specifically addresses the high variance typically associated with RL fine-tuning in web-based environments.
- โขThe framework incorporates a 'Dynamic Subgoal Re-planning' mechanism that triggers automatically when the agent detects a deviation from the expected DOM (Document Object Model) state, reducing cumulative error.
- โขUnlike previous WebRL approaches that rely heavily on offline trajectory datasets, MiRA demonstrates significant sample efficiency by leveraging a hybrid training loop that combines synthetic trajectory generation with real-time environment feedback.
๐ Competitor Analysisโธ Show
| Feature | MiRA (Gemma3-12B) | WebRL (SOTA) | GPT-4o (Agentic) |
|---|---|---|---|
| Planning Strategy | Dynamic Subgoal Decomposition | Static/Heuristic | Prompt-based (CoT) |
| RL Approach | Milestone-based RWR | PPO-based | None (In-context) |
| WebArena-Lite SR | 43% | 38.4% | 13.9% |
| Training Cost | Moderate (Fine-tuning) | High (Full RL) | N/A (Proprietary) |
๐ ๏ธ Technical Deep Dive
- โขArchitecture: Employs a dual-tower structure where a lightweight 'Planner' module generates subgoals, and a 'Policy' module (Gemma3-12B) executes actions.
- โขReward Function: Uses a dense reward signal derived from DOM-tree distance metrics and successful completion of intermediate HTML-element interactions.
- โขExecution Drift Mitigation: Implements a 'State-Consistency Check' that compares the current browser state against the predicted state from the subgoal planner; if the divergence exceeds a threshold, the agent forces a re-plan.
- โขTraining Methodology: Utilizes a two-stage process: (1) Supervised fine-tuning on successful trajectories, followed by (2) Reward-Weighted Regression (RWR) to optimize for milestone completion.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
Small-scale LLMs will become the standard for autonomous web agents.
The success of MiRA demonstrates that specialized RL fine-tuning on 12B-parameter models can outperform massive, general-purpose frontier models in specific task-oriented domains.
Web-based automation will shift from prompt-engineering to RL-based agent training.
The performance gap between MiRA and GPT-4o suggests that architectural planning and RL-based reward optimization are more effective for long-horizon web tasks than in-context learning alone.
โณ Timeline
2025-11
Initial release of WebArena-Lite benchmark for evaluating long-horizon web agents.
2026-01
Development of the MiRA subgoal-driven framework begins, focusing on sparse reward signals.
2026-03
MiRA research paper published on ArXiv, demonstrating 43% success rate on WebArena-Lite.
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: ArXiv AI โ