All Updates

Page 734 of 752

February 13, 2026

INTENT: Budget Planning for Tool Agents

INTENT is an inference-time planner for budget-constrained LLM agents using costly tools. Leverages hierarchical world model for intention-aware cost anticipation. Outperforms baselines on StableToolBench under budgets and price shifts.

#research#arxiv#intent

📄

ArXiv AI•65d ago

Human-Inspired Learning for Adaptive Reasoning

Proposes a framework for continuous learning of internal reasoning processes in AI, unifying reasoning, action, reflection, and verification. It treats thinking trajectories as learning material to evolve cognitive structures during execution. Experiments show 23.9% runtime reduction on sensor tasks.

#research#ai#reasoning

📄

ArXiv AI•65d ago

GHOST Prunes Mamba2 Hidden States Efficiently

GHOST applies structured pruning to Mamba2 using forward-pass controllability and observability metrics, avoiding backpropagation. Achieves 50% state reduction with ~1 PPL rise on WikiText-2 across 130M-2.7B models. Code available anonymously.

#research#ghost#mamba2

📄

ArXiv AI•65d ago

Exposing Ground Truth Illusion in Annotations

Literature review critiques 'ground truth' in ML data annotation as a positivistic fallacy ignoring human subjectivity. Analyzes 346 papers from top venues revealing biases like anchoring and geographic hegemony. Proposes roadmap for pluralistic infrastructures embracing disagreement.

#research#arxiv#none

📄

ArXiv AI•65d ago

ERM Fixes Causal Rung Collapse in LLMs

New research identifies 'rung collapse' in LLMs, where models confuse associations with causal interventions, leading to flawed reasoning under distributional shifts. It proposes Epistemic Regret Minimization (ERM), a belief revision method that penalizes causal errors independently of task success. Experiments across six frontier LLMs show ERM recovers 53-59% of entrenched errors.

#research#llms#v1

📄

ArXiv AI•65d ago

DrIGM Enables Robust Multi-Agent RL

DrIGM introduces distributionally robust IGM for MARL, ensuring decentralized actions align under uncertainties via robust value factorization. Compatible with VDN/QMIX/QTRAN without reward shaping. Boosts OOD performance in SustainGym and StarCraft.

#research#arxiv#drigm

📄

ArXiv AI•65d ago

Decision-Valued Maps in DecisionDB

Formalizes decision-valued maps tracking representation impacts on outcomes. DecisionDB logs, replays, audits using content-based IDs and write-once storage. Partitions representation space into persistence regions.

#research#decisiondb#ai-decisions

📄

ArXiv AI•65d ago

DashAI No-Code XAI User Study

DashAI introduces a human-centered XAI module integrating PDP, PFI, and KernelSHAP for no-code ML users. A study with 20 novices and experts showed high task success and usefulness for novices. Explanations boosted trust, especially among beginners.

#research#dashai#xai-module

📄

ArXiv AI•65d ago

Crosscoders Unlock Cross-Architecture LLM Diffing

Researchers apply Crosscoders for the first time to compare LLMs across different architectures, introducing Dedicated Feature Crosscoders (DFCs) to isolate unique model features. The method unsupervisedly detects behaviors like Chinese Communist Party alignment in Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and copyright refusals in GPT-OSS-20B.

#research#crosscoders#dfc

📄

ArXiv AI•65d ago

CausalAgent: Conversational Causal Inference

CausalAgent is a multi-agent system automating end-to-end causal inference via natural language. Integrates MAS, RAG, and MCP for data cleaning to report generation. Lowers barriers with interactive visualizations for non-experts.

#research#arxiv#causalagent

📄

ArXiv AI•65d ago

C-JEPA Learns World Models via Object Masking

C-JEPA extends masked joint embedding prediction to object-centric representations with object-level masking, inducing latent interventions for interaction reasoning. It boosts counterfactual VQA by 20% and enables efficient agent planning using 1% of latent features. Code is on GitHub.

#research#c-jepa#world-models

📄

ArXiv AI•65d ago

Bi-Level Optimization for Multimodal LLM Judges

Introduces BLPO to optimize prompts for multimodal LLM-as-a-judge evaluating AI images. Overcomes context limits by converting images to text representations. Outperforms baselines on four datasets with three LLM judges.

#research#blpo#multimodal

📄

ArXiv AI•65d ago

BHI Framework Audits LLM Benchmarks

Introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks amid reliability issues like score inflation. Evaluates along three axes: Capability Discrimination, Anti-Saturation, and Impact. Analyzes 106 benchmarks from 91 models in 2025.

#research#bhi#v1

📄

ArXiv AI•65d ago

Benchmarking LLM Agents Under Noise

AgentNoiseBench evaluates tool-using LLM agents' robustness in noisy real-world environments. Categorizes noise into user-noise and tool-noise; injects controllable perturbations into benchmarks. Reveals performance drops across models under perturbations.

#research#arxiv#agentnoisebench

📄

ArXiv AI•65d ago

Benchmark for LLM Replication in Sciences

ReplicatorBench tests LLM agents on replicating social/behavioral science claims end-to-end. Covers extraction, experiments, and interpretation with replicable/non-replicable cases. ReplicatorAgent baselines show strengths in execution but weaknesses in data retrieval.

#research#replicatorbench#llm-agents

📄

ArXiv AI•65d ago

Behavioral Optimization for Proactive Agents

BAO uses agentic RL to train proactive LLM agents balancing performance and user engagement. Combines behavior enhancement with regularization to align with user expectations. Outperforms baselines on UserRL benchmarks.

#research#bao#llm-agents

📄

ArXiv AI•65d ago

AT-RL Reinforces MLLM Anchors for Reasoning

AT-RL selectively reinforces high-connectivity cross-modal anchor tokens (15% of total) in MLLM RLVR via attention graph clustering. 32B model hits 80.2% on MathVista, beating 72B baseline with 1.2% overhead. Low-connectivity training degrades performance.

#research#mllm#at-rl

📄

ArXiv AI•65d ago

ARC Learns Dynamic Agent Configurations

ARC introduces a reinforcement learning policy to dynamically configure LLM-based agent systems per query, selecting optimal workflows, tools, and prompts. It outperforms fixed templates on reasoning and tool-augmented QA benchmarks. The approach boosts accuracy by up to 25% while cutting token and runtime costs.

#research#arc#llm-agents

📄

ArXiv AI•65d ago

AIR Boosts LLM Agent Safety

AIR is the first incident response framework for LLM agents, focusing on detecting, containing, recovering from, and eradicating incidents post-occurrence. It integrates a domain-specific language into the agent's execution loop for autonomous management. Evaluations across agent types show over 90% success rates in all phases.

#research#air#llm-agents

📄

ArXiv AI•65d ago

AgentLeak: Multi-Agent Privacy Leak Benchmark

AgentLeak introduces the first full-stack benchmark for privacy leakage in multi-agent LLM systems, covering internal channels like inter-agent messages. It spans 1,000 scenarios across healthcare, finance, legal, and corporate domains. Tests on top models show internal channels cause 68.9% total leakage, missed by output audits.

#research#agentleak#multi-agent

1733 734 735752

Page 734 of 752

Back to Home