All Updates
Page 692 of 755
February 19, 2026
Musk Predicts AI Binary Coding by 2026
Elon Musk predicts in a recent video that by the end of 2026, AI will directly write binary code, greatly reducing human reliance on programming languages. This could lead to full automation in the programming industry, potentially eliminating traditional programmers.
Verifiable Semantics for Agent Communication
Proposes a certification protocol using stimulus-meaning model to verify shared term understanding in multi-agent systems via tests on observable events. Core-guarded reasoning limits agents to certified terms, provably bounding disagreement. Simulations show 72-96% disagreement reduction; LLM validation achieves 51% drop.
Science of AI Agent Reliability
AI agents excel on benchmarks but fail in practice due to single-metric evaluations ignoring consistency, robustness, predictability, and safety. This arXiv paper proposes 12 concrete metrics across these four dimensions, grounded in safety-critical engineering. Tests on 14 agentic models across two benchmarks reveal only marginal reliability gains despite capability advances.
Proxy State Eval Scales LLM Agent Benchmarks
Proxy State-Based Evaluation introduces an LLM-driven simulation framework for benchmarking multi-turn tool-calling agents, avoiding costly deterministic backends. It uses scenarios to define goals and states, with LLM trackers inferring proxy states from traces for verification. The method yields stable rankings, high judge agreement over 90%, and transferable training data.
PAHF: Personalized Agents from Human Feedback
PAHF is a framework for continual personalization of AI agents, learning online from live human interactions via explicit per-user memory. It uses a three-step loop: pre-action clarification, preference-grounded actions, and post-action feedback for memory updates. Evaluated on new benchmarks for manipulation and shopping, it outperforms baselines in initial learning and adaptation to preference shifts.
Mirror Tops GPT-5 on Endo Board Exam
January Mirror, an evidence-grounded clinical reasoning system, scored 87.5% on a 120-question 2025 endocrinology board-style exam, outperforming human experts (62.3%) and frontier LLMs like GPT-5.2 (74.6%). It excelled on the hardest questions (76.7% accuracy) under closed-evidence constraints without web access. Outputs featured traceable citations from guidelines with 100% accuracy.
In-Context Inference Enables Multi-Agent Cooperation
Researchers demonstrate that sequence models' in-context learning induces cooperation in multi-agent RL without hardcoded co-player assumptions or timescale separation. Training against diverse co-players leads to best-response strategies on intra-episode timescales. This naturally emerges mutual shaping via extortion vulnerability, providing a scalable path to cooperative behaviors.
GPSBench Tests LLM GPS Reasoning
Researchers launch GPSBench, a 57,800-sample dataset across 17 tasks to probe LLMs' geospatial reasoning without tools. 14 SOTA LLMs show reliability in geographic knowledge but struggle with geometric computations like distance and bearing. Dataset, code, and findings reveal trade-offs in finetuning and augmentation benefits.
FoT: Dynamic LLM Reasoning Optimizer
FoT introduces a general-purpose framework for dynamic reasoning schemes in LLMs, overcoming static structures in Chain of Thought, Tree of Thoughts, and Graph of Thoughts. It features hyperparameter tuning, prompt optimization, parallel execution, and caching for better performance. The open-source codebase demonstrates faster execution, reduced costs, and improved task scores.
Corecraft RL Env Trains Generalizable Agents
Surge AI launches Corecraft, the first high-fidelity RL environment in EnterpriseGym, simulating enterprise customer support with 2,500+ entities and 23 tools. Training GLM 4.6 via GRPO improves task pass@1 from 25% to 37% on held-out tasks, with gains transferring to BFCL (+4.5%), ฯยฒ-Bench Retail (+7.4%), and Toolathlon (+6.8%). Results highlight task-centric design, expert rubrics, and realistic workflows as keys to generalization.
CaR Enables Efficient Neural Routing Constraints
Neural solvers excel in simple routing but falter on complex constraints. CaR introduces the first general framework using explicit learning-based feasibility refinement and joint training to generate diverse solutions for lightweight improvement. It outperforms SOTA solvers in feasibility, quality, and efficiency on hard constraints.
CAFE: Causal Multi-Agent AFE Breakthrough
CAFE reformulates automated feature engineering as a causally-guided sequential decision process using causal discovery for soft priors and multi-agent RL for construction. It outperforms baselines by up to 7% on 15 benchmarks and reduces performance drops 4x under covariate shifts. The framework produces compact, stable features with reliable attributions.
Boosting LLM Feedback-Driven In-Context Learning
Proposes a trainable framework for interactive in-context learning using multi-turn feedback from information asymmetry on verifiable tasks. Trained smaller models nearly match performance of 10x larger models and generalize to coding, puzzles, and mazes. Enables self-improvement by internally modeling teacher critiques.
AI Long-Term Memory: Store-First Paradigm
This arXiv paper proposes a 'store then on-demand extract' approach for AI memory to retain raw experiences and avoid information loss from the dominant 'extract then store' method. It also explores deriving deeper insights from probabilistic experiences and improving efficiency via shared storage. Simple experiments validate these ideas, while outlining challenges and research directions.
Agentic AI Fails Paradoxically on Rare Symptoms
Autonomous agentic workflows exhibit optimization instability, where iterative self-improvement degrades classifier performance, especially for low-prevalence clinical symptoms like Long COVID brain fog (3%). Using the open-source Pythia framework, validation sensitivity oscillated wildly between 1.0 and 0.0. A selector agent that retrospectively picks the best iteration outperformed guiding agents and expert lexicons by 331% F1 on brain fog.
Agent Skills Boost SLMs for Industry
ArXiv paper defines Agent Skill framework mathematically and evaluates its benefits for small language models (SLMs) in industrial settings with data security constraints. Moderate SLMs (12B-30B params) show substantial gains in accuracy and reduced hallucinations; 80B code-specialized variants match proprietary models with better GPU efficiency. Insights guide SLM deployments avoiding public APIs.
Chunwan Sparks China Humanoid Robot Boom
Morgan Stanley identifies 2026 as a pivotal inflection point for China's humanoid robot market, mirroring the 2019-2020 NEV surge. IDC predicts application scenarios will triple by 2026. The Spring Festival Gala has heightened visibility for robotics.
Tesla FSD Supervised Hits 8B Miles
Tesla announced FSD Supervised cumulative mileage exceeds 8 billion miles, up from 7 billion in December 2024. This data accelerates training for unsupervised Full Self-Driving. Elon Musk states 10 billion miles needed to handle complex long-tail scenarios.
Ant's Trillion-Param Open Model Excels in EQ & Agents
Ant Group launches a trillion-parameter open-source model superior in human understanding and execution. It excels in emotional intelligence and agent combat power. The massive model runs efficiently like a lightweight one.
AI & Robots Flock to One Spot After Chunwan
After China's Spring Festival Gala (Chunwan), various AI systems and robots have converged to a single location. This place has captured the 'stray' elements from the gala, hinted at with a doge meme. The article teases this post-event trend in AI and robotics popularity.