All Updates
Page 594 of 612
February 13, 2026
Bi-Level Optimization for Multimodal LLM Judges
Introduces BLPO to optimize prompts for multimodal LLM-as-a-judge evaluating AI images. Overcomes context limits by converting images to text representations. Outperforms baselines on four datasets with three LLM judges.
BHI Framework Audits LLM Benchmarks
Introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks amid reliability issues like score inflation. Evaluates along three axes: Capability Discrimination, Anti-Saturation, and Impact. Analyzes 106 benchmarks from 91 models in 2025.
Benchmarking LLM Agents Under Noise
AgentNoiseBench evaluates tool-using LLM agents' robustness in noisy real-world environments. Categorizes noise into user-noise and tool-noise; injects controllable perturbations into benchmarks. Reveals performance drops across models under perturbations.
Benchmark for LLM Replication in Sciences
ReplicatorBench tests LLM agents on replicating social/behavioral science claims end-to-end. Covers extraction, experiments, and interpretation with replicable/non-replicable cases. ReplicatorAgent baselines show strengths in execution but weaknesses in data retrieval.
Behavioral Optimization for Proactive Agents
BAO uses agentic RL to train proactive LLM agents balancing performance and user engagement. Combines behavior enhancement with regularization to align with user expectations. Outperforms baselines on UserRL benchmarks.
AT-RL Reinforces MLLM Anchors for Reasoning
AT-RL selectively reinforces high-connectivity cross-modal anchor tokens (15% of total) in MLLM RLVR via attention graph clustering. 32B model hits 80.2% on MathVista, beating 72B baseline with 1.2% overhead. Low-connectivity training degrades performance.
ARC Learns Dynamic Agent Configurations
ARC introduces a reinforcement learning policy to dynamically configure LLM-based agent systems per query, selecting optimal workflows, tools, and prompts. It outperforms fixed templates on reasoning and tool-augmented QA benchmarks. The approach boosts accuracy by up to 25% while cutting token and runtime costs.
AIR Boosts LLM Agent Safety
AIR is the first incident response framework for LLM agents, focusing on detecting, containing, recovering from, and eradicating incidents post-occurrence. It integrates a domain-specific language into the agent's execution loop for autonomous management. Evaluations across agent types show over 90% success rates in all phases.
AgentLeak: Multi-Agent Privacy Leak Benchmark
AgentLeak introduces the first full-stack benchmark for privacy leakage in multi-agent LLM systems, covering internal channels like inter-agent messages. It spans 1,000 scenarios across healthcare, finance, legal, and corporate domains. Tests on top models show internal channels cause 68.9% total leakage, missed by output audits.
Gemini 3 Deep Think Dominates Programming
Gemini 3 Deep Think receives a major upgrade, achieving state-of-the-art results across domains, especially programming. Only 7 people globally outperform it. Google VP shares it as a side project.
Gemini 3 Deep Think Dominates Coding
Gemini 3 Deep Think upgrade achieves SOTA across domains, especially programming where only 7 people worldwide outperform it. This Google VP side project marks a new era in AI reasoning. It showcases unprecedented inference capabilities.
Ex-Researcher Warns on ChatGPT Ads
Former OpenAI researcher Zoรซ Hitzig warns ads in ChatGPT risk user manipulation like Facebook. She left after ad testing amid privacy concerns from user-shared intimate thoughts. Now at Harvard.
Ex-Researcher Warns ChatGPT Ads Risks
Former OpenAI researcher Zoรซ Hitzig quit after testing ChatGPT ads, warning of manipulation risks from users' private data. She compares it to Facebook's pitfalls. Now at Harvard, she urges caution on ad systems.
OpenAI Adds Ads to ChatGPT
OpenAI is launching ads on ChatGPT this week amid billions in funding needs. CEO Sam Altman previously opposed ads, calling them a last resort that could erode user trust. This shift aims to bolster revenue for the AI leader.
Gemini 3 Deep Think: Sketch to 3D
Google unveiled a major upgrade to Gemini 3 Deep Think, a reasoning model for science, research, and engineering. Google AI Ultra subscribers can access it now in the Gemini App. Early API access is open to select researchers, engineers, and enterprises.
Gemini 3 Deep Think: Sketch-to-3D Upgrade
Google announced a major upgrade to Gemini 3 Deep Think, a reasoning model for science, research, and engineering. Google AI Ultra subscribers can access it via the Gemini App. Early API access is open to select researchers, engineers, and enterprises.
Anthropic Targets OpenAI in Super Bowl Ads
AI companies with deep user privacy access are rushing to monetize via ads amid lax regulation. Anthropic's Super Bowl ads satirized OpenAI's vulnerabilities without naming them. This highlights reliance on corporate ethics to prevent privacy abuse.
AI Safety Leader Quits for Poetry
An AI safety leader warns of global peril and resigns to study poetry. This coincides with an OpenAI researcher quitting over ChatGPT ad testing plans.
AI Safety Chief Quits, Cites Global Peril
A prominent AI safety leader resigned, warning the world is in peril, to study poetry. This follows an OpenAI researcher's exit over plans to test ChatGPT ads. The moves highlight tensions in AI development and commercialization.
Samsung First Ships HBM4 Memory
Samsung claims first to ship HBM4 memory, a day after Micron's announcement. HBM4 provides faster, denser RAM for next-gen AI hardware. This aligns with Nvidia's Vera Rubin GPU timeline.