All Updates

Page 692 of 755

February 19, 2026

Musk Predicts AI Binary Coding by 2026

Elon Musk predicts in a recent video that by the end of 2026, AI will directly write binary code, greatly reducing human reliance on programming languages. This could lead to full automation in the programming industry, potentially eliminating traditional programmers.

#prediction#automation#binary-code

📄

ArXiv AI•59d ago

Verifiable Semantics for Agent Communication

Proposes a certification protocol using stimulus-meaning model to verify shared term understanding in multi-agent systems via tests on observable events. Core-guarded reasoning limits agents to certified terms, provably bounding disagreement. Simulations show 72-96% disagreement reduction; LLM validation achieves 51% drop.

#multi-agent#semantic-drift#core-guarded

📄

ArXiv AI•59d ago

Science of AI Agent Reliability

AI agents excel on benchmarks but fail in practice due to single-metric evaluations ignoring consistency, robustness, predictability, and safety. This arXiv paper proposes 12 concrete metrics across these four dimensions, grounded in safety-critical engineering. Tests on 14 agentic models across two benchmarks reveal only marginal reliability gains despite capability advances.

#ai-agents#reliability-metrics#evaluation-framework

📄

ArXiv AI•59d ago

Proxy State Eval Scales LLM Agent Benchmarks

Proxy State-Based Evaluation introduces an LLM-driven simulation framework for benchmarking multi-turn tool-calling agents, avoiding costly deterministic backends. It uses scenarios to define goals and states, with LLM trackers inferring proxy states from traces for verification. The method yields stable rankings, high judge agreement over 90%, and transferable training data.

#multi-turn-agents#tool-calling#llm-judges

📄

ArXiv AI•59d ago

PAHF: Personalized Agents from Human Feedback

PAHF is a framework for continual personalization of AI agents, learning online from live human interactions via explicit per-user memory. It uses a three-step loop: pre-action clarification, preference-grounded actions, and post-action feedback for memory updates. Evaluated on new benchmarks for manipulation and shopping, it outperforms baselines in initial learning and adaptation to preference shifts.

#rlhf#agent#personalization

📄

ArXiv AI•59d ago

Mirror Tops GPT-5 on Endo Board Exam

January Mirror, an evidence-grounded clinical reasoning system, scored 87.5% on a 120-question 2025 endocrinology board-style exam, outperforming human experts (62.3%) and frontier LLMs like GPT-5.2 (74.6%). It excelled on the hardest questions (76.7% accuracy) under closed-evidence constraints without web access. Outputs featured traceable citations from guidelines with 100% accuracy.

#clinical-reasoning#medical-ai#benchmark

📄

ArXiv AI•59d ago

In-Context Inference Enables Multi-Agent Cooperation

Researchers demonstrate that sequence models' in-context learning induces cooperation in multi-agent RL without hardcoded co-player assumptions or timescale separation. Training against diverse co-players leads to best-response strategies on intra-episode timescales. This naturally emerges mutual shaping via extortion vulnerability, providing a scalable path to cooperative behaviors.

#multi-agent-rl#in-context-learning#co-player-inference

📄

ArXiv AI•59d ago

GPSBench Tests LLM GPS Reasoning

Researchers launch GPSBench, a 57,800-sample dataset across 17 tasks to probe LLMs' geospatial reasoning without tools. 14 SOTA LLMs show reliability in geographic knowledge but struggle with geometric computations like distance and bearing. Dataset, code, and findings reveal trade-offs in finetuning and augmentation benefits.

#geospatial-reasoning#gps-coordinates#llm-benchmark

📄

ArXiv AI•59d ago

FoT: Dynamic LLM Reasoning Optimizer

FoT introduces a general-purpose framework for dynamic reasoning schemes in LLMs, overcoming static structures in Chain of Thought, Tree of Thoughts, and Graph of Thoughts. It features hyperparameter tuning, prompt optimization, parallel execution, and caching for better performance. The open-source codebase demonstrates faster execution, reduced costs, and improved task scores.

#reasoning-framework#prompt-optimization#parallel-execution

📄

ArXiv AI•59d ago

Corecraft RL Env Trains Generalizable Agents

Surge AI launches Corecraft, the first high-fidelity RL environment in EnterpriseGym, simulating enterprise customer support with 2,500+ entities and 23 tools. Training GLM 4.6 via GRPO improves task pass@1 from 25% to 37% on held-out tasks, with gains transferring to BFCL (+4.5%), τ²-Bench Retail (+7.4%), and Toolathlon (+6.8%). Results highlight task-centric design, expert rubrics, and realistic workflows as keys to generalization.

#rl-environment#agent-generalization#enterprise-sim

📄

ArXiv AI•59d ago

CaR Enables Efficient Neural Routing Constraints

Neural solvers excel in simple routing but falter on complex constraints. CaR introduces the first general framework using explicit learning-based feasibility refinement and joint training to generate diverse solutions for lightweight improvement. It outperforms SOTA solvers in feasibility, quality, and efficiency on hard constraints.

#routing-optimization#shared-encoder

📄

ArXiv AI•59d ago

CAFE: Causal Multi-Agent AFE Breakthrough

CAFE reformulates automated feature engineering as a causally-guided sequential decision process using causal discovery for soft priors and multi-agent RL for construction. It outperforms baselines by up to 7% on 15 benchmarks and reduces performance drops 4x under covariate shifts. The framework produces compact, stable features with reliable attributions.

#feature-engineering#causal-discovery#multi-agent-rl

📄

ArXiv AI•59d ago

Boosting LLM Feedback-Driven In-Context Learning

Proposes a trainable framework for interactive in-context learning using multi-turn feedback from information asymmetry on verifiable tasks. Trained smaller models nearly match performance of 10x larger models and generalize to coding, puzzles, and mazes. Enables self-improvement by internally modeling teacher critiques.

#in-context-learning#feedback-loops#self-improvement

📄

ArXiv AI•59d ago

AI Long-Term Memory: Store-First Paradigm

This arXiv paper proposes a 'store then on-demand extract' approach for AI memory to retain raw experiences and avoid information loss from the dominant 'extract then store' method. It also explores deriving deeper insights from probabilistic experiences and improving efficiency via shared storage. Simple experiments validate these ideas, while outlining challenges and research directions.

#long-term-memory#store-extract#asi

📄

ArXiv AI•59d ago

Agentic AI Fails Paradoxically on Rare Symptoms

Autonomous agentic workflows exhibit optimization instability, where iterative self-improvement degrades classifier performance, especially for low-prevalence clinical symptoms like Long COVID brain fog (3%). Using the open-source Pythia framework, validation sensitivity oscillated wildly between 1.0 and 0.0. A selector agent that retrospectively picks the best iteration outperformed guiding agents and expert lexicons by 331% F1 on brain fog.

#agentic-workflows#clinical-detection#low-prevalence

📄

ArXiv AI•59d ago

Agent Skills Boost SLMs for Industry

ArXiv paper defines Agent Skill framework mathematically and evaluates its benefits for small language models (SLMs) in industrial settings with data security constraints. Moderate SLMs (12B-30B params) show substantial gains in accuracy and reduced hallucinations; 80B code-specialized variants match proprietary models with better GPU efficiency. Insights guide SLM deployments avoiding public APIs.

#slms#agent-skills#skill-selection

🇨🇳

cnBeta (Full RSS)•59d ago

Chunwan Sparks China Humanoid Robot Boom

Morgan Stanley identifies 2026 as a pivotal inflection point for China's humanoid robot market, mirroring the 2019-2020 NEV surge. IDC predicts application scenarios will triple by 2026. The Spring Festival Gala has heightened visibility for robotics.

#embodied-ai#market-forecast#china-growth

🏠

IT之家•59d ago

Tesla FSD Supervised Hits 8B Miles

Tesla announced FSD Supervised cumulative mileage exceeds 8 billion miles, up from 7 billion in December 2024. This data accelerates training for unsupervised Full Self-Driving. Elon Musk states 10 billion miles needed to handle complex long-tail scenarios.

#autonomous-driving#fleet-data#long-tail

⚛️

量子位•59d ago

Ant's Trillion-Param Open Model Excels in EQ & Agents

Ant Group launches a trillion-parameter open-source model superior in human understanding and execution. It excels in emotional intelligence and agent combat power. The massive model runs efficiently like a lightweight one.

#open-source#agentic-ai#trillion-params

⚛️

量子位•59d ago

AI & Robots Flock to One Spot After Chunwan

After China's Spring Festival Gala (Chunwan), various AI systems and robots have converged to a single location. This place has captured the 'stray' elements from the gala, hinted at with a doge meme. The article teases this post-event trend in AI and robotics popularity.

#embodied-ai#chunwan-followup#viral-trends

1691 692 693755

Page 692 of 755

Back to Home