All Updates
Page 734 of 752
February 13, 2026
INTENT: Budget Planning for Tool Agents
INTENT is an inference-time planner for budget-constrained LLM agents using costly tools. Leverages hierarchical world model for intention-aware cost anticipation. Outperforms baselines on StableToolBench under budgets and price shifts.
Human-Inspired Learning for Adaptive Reasoning
Proposes a framework for continuous learning of internal reasoning processes in AI, unifying reasoning, action, reflection, and verification. It treats thinking trajectories as learning material to evolve cognitive structures during execution. Experiments show 23.9% runtime reduction on sensor tasks.
GHOST Prunes Mamba2 Hidden States Efficiently
GHOST applies structured pruning to Mamba2 using forward-pass controllability and observability metrics, avoiding backpropagation. Achieves 50% state reduction with ~1 PPL rise on WikiText-2 across 130M-2.7B models. Code available anonymously.
Exposing Ground Truth Illusion in Annotations
Literature review critiques 'ground truth' in ML data annotation as a positivistic fallacy ignoring human subjectivity. Analyzes 346 papers from top venues revealing biases like anchoring and geographic hegemony. Proposes roadmap for pluralistic infrastructures embracing disagreement.
ERM Fixes Causal Rung Collapse in LLMs
New research identifies 'rung collapse' in LLMs, where models confuse associations with causal interventions, leading to flawed reasoning under distributional shifts. It proposes Epistemic Regret Minimization (ERM), a belief revision method that penalizes causal errors independently of task success. Experiments across six frontier LLMs show ERM recovers 53-59% of entrenched errors.
DrIGM Enables Robust Multi-Agent RL
DrIGM introduces distributionally robust IGM for MARL, ensuring decentralized actions align under uncertainties via robust value factorization. Compatible with VDN/QMIX/QTRAN without reward shaping. Boosts OOD performance in SustainGym and StarCraft.
Decision-Valued Maps in DecisionDB
Formalizes decision-valued maps tracking representation impacts on outcomes. DecisionDB logs, replays, audits using content-based IDs and write-once storage. Partitions representation space into persistence regions.
DashAI No-Code XAI User Study
DashAI introduces a human-centered XAI module integrating PDP, PFI, and KernelSHAP for no-code ML users. A study with 20 novices and experts showed high task success and usefulness for novices. Explanations boosted trust, especially among beginners.
Crosscoders Unlock Cross-Architecture LLM Diffing
Researchers apply Crosscoders for the first time to compare LLMs across different architectures, introducing Dedicated Feature Crosscoders (DFCs) to isolate unique model features. The method unsupervisedly detects behaviors like Chinese Communist Party alignment in Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and copyright refusals in GPT-OSS-20B.
CausalAgent: Conversational Causal Inference
CausalAgent is a multi-agent system automating end-to-end causal inference via natural language. Integrates MAS, RAG, and MCP for data cleaning to report generation. Lowers barriers with interactive visualizations for non-experts.
C-JEPA Learns World Models via Object Masking
C-JEPA extends masked joint embedding prediction to object-centric representations with object-level masking, inducing latent interventions for interaction reasoning. It boosts counterfactual VQA by 20% and enables efficient agent planning using 1% of latent features. Code is on GitHub.
Bi-Level Optimization for Multimodal LLM Judges
Introduces BLPO to optimize prompts for multimodal LLM-as-a-judge evaluating AI images. Overcomes context limits by converting images to text representations. Outperforms baselines on four datasets with three LLM judges.
BHI Framework Audits LLM Benchmarks
Introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks amid reliability issues like score inflation. Evaluates along three axes: Capability Discrimination, Anti-Saturation, and Impact. Analyzes 106 benchmarks from 91 models in 2025.
Benchmarking LLM Agents Under Noise
AgentNoiseBench evaluates tool-using LLM agents' robustness in noisy real-world environments. Categorizes noise into user-noise and tool-noise; injects controllable perturbations into benchmarks. Reveals performance drops across models under perturbations.
Benchmark for LLM Replication in Sciences
ReplicatorBench tests LLM agents on replicating social/behavioral science claims end-to-end. Covers extraction, experiments, and interpretation with replicable/non-replicable cases. ReplicatorAgent baselines show strengths in execution but weaknesses in data retrieval.
Behavioral Optimization for Proactive Agents
BAO uses agentic RL to train proactive LLM agents balancing performance and user engagement. Combines behavior enhancement with regularization to align with user expectations. Outperforms baselines on UserRL benchmarks.
AT-RL Reinforces MLLM Anchors for Reasoning
AT-RL selectively reinforces high-connectivity cross-modal anchor tokens (15% of total) in MLLM RLVR via attention graph clustering. 32B model hits 80.2% on MathVista, beating 72B baseline with 1.2% overhead. Low-connectivity training degrades performance.
ARC Learns Dynamic Agent Configurations
ARC introduces a reinforcement learning policy to dynamically configure LLM-based agent systems per query, selecting optimal workflows, tools, and prompts. It outperforms fixed templates on reasoning and tool-augmented QA benchmarks. The approach boosts accuracy by up to 25% while cutting token and runtime costs.
AIR Boosts LLM Agent Safety
AIR is the first incident response framework for LLM agents, focusing on detecting, containing, recovering from, and eradicating incidents post-occurrence. It integrates a domain-specific language into the agent's execution loop for autonomous management. Evaluations across agent types show over 90% success rates in all phases.
AgentLeak: Multi-Agent Privacy Leak Benchmark
AgentLeak introduces the first full-stack benchmark for privacy leakage in multi-agent LLM systems, covering internal channels like inter-agent messages. It spans 1,000 scenarios across healthcare, finance, legal, and corporate domains. Tests on top models show internal channels cause 68.9% total leakage, missed by output audits.