All Updates

BHI Framework Audits LLM Benchmarks

Introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks amid reliability issues like score inflation. Evaluates along three axes: Capability Discrimination, Anti-Saturation, and Impact. Analyzes 106 benchmarks from 91 models in 2025.

#research#bhi#v1

#research#arxiv#agentnoisebench

Benchmarking LLM Agents Under Noise

AgentNoiseBench evaluates tool-using LLM agents' robustness in noisy real-world environments. Categorizes noise into user-noise and tool-noise; injects controllable perturbations into benchmarks. Reveals performance drops across models under perturbations.

#research#replicatorbench#llm-agents

Benchmark for LLM Replication in Sciences

ReplicatorBench tests LLM agents on replicating social/behavioral science claims end-to-end. Covers extraction, experiments, and interpretation with replicable/non-replicable cases. ReplicatorAgent baselines show strengths in execution but weaknesses in data retrieval.

Behavioral Optimization for Proactive Agents

BAO uses agentic RL to train proactive LLM agents balancing performance and user engagement. Combines behavior enhancement with regularization to align with user expectations. Outperforms baselines on UserRL benchmarks.

#research#bao#llm-agents

AT-RL Reinforces MLLM Anchors for Reasoning

AT-RL selectively reinforces high-connectivity cross-modal anchor tokens (15% of total) in MLLM RLVR via attention graph clustering. 32B model hits 80.2% on MathVista, beating 72B baseline with 1.2% overhead. Low-connectivity training degrades performance.

#research#mllm#at-rl

ARC Learns Dynamic Agent Configurations

ARC introduces a reinforcement learning policy to dynamically configure LLM-based agent systems per query, selecting optimal workflows, tools, and prompts. It outperforms fixed templates on reasoning and tool-augmented QA benchmarks. The approach boosts accuracy by up to 25% while cutting token and runtime costs.

#research#arc#llm-agents

AIR Boosts LLM Agent Safety

AIR is the first incident response framework for LLM agents, focusing on detecting, containing, recovering from, and eradicating incidents post-occurrence. It integrates a domain-specific language into the agent's execution loop for autonomous management. Evaluations across agent types show over 90% success rates in all phases.

#research#air#llm-agents

#research#agentleak#multi-agent

AgentLeak: Multi-Agent Privacy Leak Benchmark

AgentLeak introduces the first full-stack benchmark for privacy leakage in multi-agent LLM systems, covering internal channels like inter-agent messages. It spans 1,000 scenarios across healthcare, finance, legal, and corporate domains. Tests on top models show internal channels cause 68.9% total leakage, missed by output audits.

#update#google-gemini#gemini-3

Gemini 3 Deep Think Dominates Programming

Gemini 3 Deep Think receives a major upgrade, achieving state-of-the-art results across domains, especially programming. Only 7 people globally outperform it. Google VP shares it as a side project.

Gemini 3 Deep Think Dominates Coding

Gemini 3 Deep Think upgrade achieves SOTA across domains, especially programming where only 7 people worldwide outperform it. This Google VP side project marks a new era in AI reasoning. It showcases unprecedented inference capabilities.

#update#gemini#gemini-3

#research#openai-chatgpt#advertising

Ex-Researcher Warns on ChatGPT Ads

Former OpenAI researcher Zoë Hitzig warns ads in ChatGPT risk user manipulation like Facebook. She left after ad testing amid privacy concerns from user-shared intimate thoughts. Now at Harvard.

#research#chatgpt#advertising

Ex-Researcher Warns ChatGPT Ads Risks

Former OpenAI researcher Zoë Hitzig quit after testing ChatGPT ads, warning of manipulation risks from users' private data. She compares it to Facebook's pitfalls. Now at Harvard, she urges caution on ad systems.

OpenAI Adds Ads to ChatGPT

OpenAI is launching ads on ChatGPT this week amid billions in funding needs. CEO Sam Altman previously opposed ads, calling them a last resort that could erode user trust. This shift aims to bolster revenue for the AI leader.

#update#openai#chatgpt

Gemini 3 Deep Think: Sketch to 3D

Google unveiled a major upgrade to Gemini 3 Deep Think, a reasoning model for science, research, and engineering. Google AI Ultra subscribers can access it now in the Gemini App. Early API access is open to select researchers, engineers, and enterprises.

#update#google#gemini-3

#update#google-gemini#3-deep-think

Gemini 3 Deep Think: Sketch-to-3D Upgrade

Google announced a major upgrade to Gemini 3 Deep Think, a reasoning model for science, research, and engineering. Google AI Ultra subscribers can access it via the Gemini App. Early API access is open to select researchers, engineers, and enterprises.