All Updates
Page 247 of 929
April 13, 2026
PilotBench: Safe Aviation AI Benchmark
PilotBench is a benchmark for evaluating LLMs on safety-critical flight trajectory and attitude prediction using 708 real-world aviation trajectories. It introduces Pilot-Score, weighting 60% accuracy and 40% instruction/safety compliance. LLMs excel in controllability but lag in precision versus traditional forecasters, especially in complex flight phases.
PETITE: Tutor-Student Boosts LLM Coding
PETITE introduces tutor-student multi-agent interaction using the same LLM to enhance coding problem-solving. The student agent generates and refines code, while the tutor provides feedback without ground-truth access. It matches or exceeds SOTA methods like Self-Consistency on APPS benchmark with far fewer tokens.
OpenKedge: Safe Agent Mutation Protocol
OpenKedge redefines AI agent mutations as governed processes via declarative intent proposals evaluated against system state and policies before execution. It enforces safety through execution contracts with bounded actions, resources, and time using ephemeral identities. The protocol introduces Intent-to-Execution Evidence Chain (IEEC) for cryptographic auditability, proven effective in multi-agent and cloud scenarios.
LOM-action: Auditable Enterprise AI Simulation
LOM-action equips enterprise AI with event-driven ontology simulation, where business events trigger graph mutations in a sandbox to evolve a scenario-valid simulation graph. Decisions are derived exclusively from this graph via a dual-mode skill and reasoning architecture, producing full audit logs. It achieves 93.82% accuracy and 98.74% F1, outperforming Doubao-1.8 and DeepSeek-V3.2 by 4x on F1.
Linear Bounds for MSO Models via Decision Diagrams
Researchers extend Courcelle's theorem, proving MSO2 models with free variables can be represented by decision diagrams of parameterized linear size in treewidth or pathwidth. Upper bounds shown for SDD (treewidth) and OBDD (pathwidth); lower bound proves OBDD limitations on bounded treewidth graphs. Links MSO logic to knowledge representation.
Feedback Search Optimizes LLM Planning Domains
Researchers model space reasoning as search in feedback space to generate planning domains from natural language using LLMs. They augment descriptions with symbolic feedback like landmarks and VAL validator outputs. Heuristic search over model space improves domain quality for practical deployment.
Artifacts as RL Agent External Memory
Researchers formalize the situated view of cognition in RL, framing the environment as functional memory via 'artifacts' that compress history information. Proofs show artifacts reduce memory needed for policies, corroborated by experiments where spatial path observations unintentionally lower memory requirements. This paves the way for exploiting environments as substitutes for internal memory.
Agents Sustain Marketing Gains Autonomously
Longitudinal case study analyzes agentic AI for marketing personalization over 11 months in a consumer app. Human-curated phase achieved highest engagement lift, while autonomous agents sustained positive gains from a fixed library. Supports hybrid human-agent model for scalable performance.
SoftBank, NEC, Honda, Sony Form 1T-Param Physical AI Venture
SoftBank is reportedly teaming up with NEC, Honda, and Sony Group to establish a new company for AI foundation model development. The model targets 1 trillion parameters for physical AI. It focuses on integrating large-scale models with robots.
ICML 2026 Review Deadline Sparks Outrage
ICML 2026 extended reviewer final justification deadline without allowing author-AC comments, frustrating authors. A reviewer raised new concerns on experiments and fairness post-rebuttal, risking rejection of strong papers. Seen as a major process mistake.
China Approves First L3 Self-Driving EVs
Chinese carmakers received approval in mid-December for EV models with Level 3 autonomous driving on public roads. They are preparing for mass production of 'hands-off' vehicles. New mandatory safety standards for autonomous vehicles are open for public comment, marking a pivotal year for self-driving tech in China.
Gemma 4 Called Out for Lazy Web Search
User complains Gemma 4 26B MoE ignores extensive web search prompts and tools, sticking to one search max. Despite instructions and skills, it prefers internal knowledge over digging deeper. Contrasts with proactive Qwen 3.5 27B.
Unitree H1 Runs 36 km/h, Claims World Champion Level
Chinese robotics firm Unitree Robotics released a video of their headless humanoid robot H1 sprinting at a top speed of 36 km/h (10 m/s). The company claims this performance reaches 'world champion level' in humanoid robotics.
Edge UI Revamp Embraces Copilot Round Corners
Microsoft Edge plans a 2026 UI overhaul with softer, larger rounded corners mirroring Copilot's design. Now managed by the AI team, it features iOS-like toggles and deeper Chromium alignment, reducing unique features. This unifies visuals across Microsoft's AI ecosystem despite performance critiques.
China's AI Plan for Lessons and Homework
Chinaโs National Data Administration published an action plan for AI in education to upskill citizens for AI adoption. The plan promotes AI for preparing school lessons and marking homework. It aims to integrate AI deeply into the education system.
Humanoid Robots: Involution and Client Wars
A screenshot unveils intense involution, anxiety, and resource battles in the humanoid robot sector. Rivals aggressively target stealing all of Unitree's customers and bids.
Meta AI Glasses: Review and Privacy Risks
Journalist Elle Hunt shares her month-long experience wearing Metaโs AI-powered smart glasses in a podcast. She highlights transformative features for vision and hearing impairments alongside privacy concerns. Mark Zuckerberg describes them as 'personal super intelligence' for staying present.
Pony.ai Unveils Self-Evolving PonyWorld 2.0
Pony.ai has unveiled PonyWorld 2.0, enabling autonomous systems to self-diagnose and evolve. This platform redefines training methods for self-driving AI, marking a paradigm shift in the industry.
MiniMax-M2.7 NVFP4 Hits 2800 tok/s on 2x RTX PRO 6000
Benchmarks on 2x RTX PRO 6000 Blackwell (96GB) show MiniMax-M2.7 NVFP4 achieving 2800 tok/s at C=128, 127.7 tok/s at C=1. Prefill up to 17k tok/s at 8k ctx. Uses SGLang with TP=2, bf16 KV, no speculative decoding yet.
Tencent Cloud Launches QClaw V2 Multi-Agent Collaboration
Tencent Cloud rolled out QClaw V2, introducing multi-agent collaboration for consumer AI assistants. This enables agents to work together on tasks. Scalability and memory limitations remain key challenges.