All Updates

Page 594 of 612

February 13, 2026

๐Ÿ“„
ArXiv AIโ€ข51d ago

Bi-Level Optimization for Multimodal LLM Judges

Introduces BLPO to optimize prompts for multimodal LLM-as-a-judge evaluating AI images. Overcomes context limits by converting images to text representations. Outperforms baselines on four datasets with three LLM judges.

#research#blpo#multimodal
๐Ÿ“„
ArXiv AIโ€ข51d ago

BHI Framework Audits LLM Benchmarks

Introduces Benchmark Health Index (BHI), a data-driven framework to audit LLM benchmarks amid reliability issues like score inflation. Evaluates along three axes: Capability Discrimination, Anti-Saturation, and Impact. Analyzes 106 benchmarks from 91 models in 2025.

#research#bhi#v1
๐Ÿ“„
ArXiv AIโ€ข51d ago

Benchmarking LLM Agents Under Noise

AgentNoiseBench evaluates tool-using LLM agents' robustness in noisy real-world environments. Categorizes noise into user-noise and tool-noise; injects controllable perturbations into benchmarks. Reveals performance drops across models under perturbations.

#research#arxiv#agentnoisebench
๐Ÿ“„
ArXiv AIโ€ข51d ago

Benchmark for LLM Replication in Sciences

ReplicatorBench tests LLM agents on replicating social/behavioral science claims end-to-end. Covers extraction, experiments, and interpretation with replicable/non-replicable cases. ReplicatorAgent baselines show strengths in execution but weaknesses in data retrieval.

#research#replicatorbench#llm-agents
๐Ÿ“„
ArXiv AIโ€ข51d ago

Behavioral Optimization for Proactive Agents

BAO uses agentic RL to train proactive LLM agents balancing performance and user engagement. Combines behavior enhancement with regularization to align with user expectations. Outperforms baselines on UserRL benchmarks.

#research#bao#llm-agents
๐Ÿ“„
ArXiv AIโ€ข51d ago

AT-RL Reinforces MLLM Anchors for Reasoning

AT-RL selectively reinforces high-connectivity cross-modal anchor tokens (15% of total) in MLLM RLVR via attention graph clustering. 32B model hits 80.2% on MathVista, beating 72B baseline with 1.2% overhead. Low-connectivity training degrades performance.

#research#mllm#at-rl
๐Ÿ“„
ArXiv AIโ€ข51d ago

ARC Learns Dynamic Agent Configurations

ARC introduces a reinforcement learning policy to dynamically configure LLM-based agent systems per query, selecting optimal workflows, tools, and prompts. It outperforms fixed templates on reasoning and tool-augmented QA benchmarks. The approach boosts accuracy by up to 25% while cutting token and runtime costs.

#research#arc#llm-agents
๐Ÿ“„
ArXiv AIโ€ข51d ago

AIR Boosts LLM Agent Safety

AIR is the first incident response framework for LLM agents, focusing on detecting, containing, recovering from, and eradicating incidents post-occurrence. It integrates a domain-specific language into the agent's execution loop for autonomous management. Evaluations across agent types show over 90% success rates in all phases.

#research#air#llm-agents
๐Ÿ“„
ArXiv AIโ€ข51d ago

AgentLeak: Multi-Agent Privacy Leak Benchmark

AgentLeak introduces the first full-stack benchmark for privacy leakage in multi-agent LLM systems, covering internal channels like inter-agent messages. It spans 1,000 scenarios across healthcare, finance, legal, and corporate domains. Tests on top models show internal channels cause 68.9% total leakage, missed by output audits.

#research#agentleak#multi-agent
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

Gemini 3 Deep Think Dominates Programming

Gemini 3 Deep Think receives a major upgrade, achieving state-of-the-art results across domains, especially programming. Only 7 people globally outperform it. Google VP shares it as a side project.

#update#google-gemini#gemini-3
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

Gemini 3 Deep Think Dominates Coding

Gemini 3 Deep Think upgrade achieves SOTA across domains, especially programming where only 7 people worldwide outperform it. This Google VP side project marks a new era in AI reasoning. It showcases unprecedented inference capabilities.

#update#gemini#gemini-3
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

Ex-Researcher Warns on ChatGPT Ads

Former OpenAI researcher Zoรซ Hitzig warns ads in ChatGPT risk user manipulation like Facebook. She left after ad testing amid privacy concerns from user-shared intimate thoughts. Now at Harvard.

#research#openai-chatgpt#advertising
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

Ex-Researcher Warns ChatGPT Ads Risks

Former OpenAI researcher Zoรซ Hitzig quit after testing ChatGPT ads, warning of manipulation risks from users' private data. She compares it to Facebook's pitfalls. Now at Harvard, she urges caution on ad systems.

#research#chatgpt#advertising
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

OpenAI Adds Ads to ChatGPT

OpenAI is launching ads on ChatGPT this week amid billions in funding needs. CEO Sam Altman previously opposed ads, calling them a last resort that could erode user trust. This shift aims to bolster revenue for the AI leader.

#update#openai#chatgpt
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

Gemini 3 Deep Think: Sketch to 3D

Google unveiled a major upgrade to Gemini 3 Deep Think, a reasoning model for science, research, and engineering. Google AI Ultra subscribers can access it now in the Gemini App. Early API access is open to select researchers, engineers, and enterprises.

#update#google#gemini-3
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

Gemini 3 Deep Think: Sketch-to-3D Upgrade

Google announced a major upgrade to Gemini 3 Deep Think, a reasoning model for science, research, and engineering. Google AI Ultra subscribers can access it via the Gemini App. Early API access is open to select researchers, engineers, and enterprises.

#update#google-gemini#3-deep-think
๐Ÿ‡จ๐Ÿ‡ณ
cnBeta (Full RSS)โ€ข51d ago

Anthropic Targets OpenAI in Super Bowl Ads

AI companies with deep user privacy access are rushing to monetize via ads amid lax regulation. Anthropic's Super Bowl ads satirized OpenAI's vulnerabilities without naming them. This highlights reliance on corporate ethics to prevent privacy abuse.

#anthropic#advertising#privacy
๐Ÿ‡ฌ๐Ÿ‡ง
BBC Technologyโ€ข51d ago

AI Safety Leader Quits for Poetry

An AI safety leader warns of global peril and resigns to study poetry. This coincides with an OpenAI researcher quitting over ChatGPT ad testing plans.

#resignation#openai#ai-safety
๐Ÿ‡ฌ๐Ÿ‡ง
BBC Technologyโ€ข51d ago

AI Safety Chief Quits, Cites Global Peril

A prominent AI safety leader resigned, warning the world is in peril, to study poetry. This follows an OpenAI researcher's exit over plans to test ChatGPT ads. The moves highlight tensions in AI development and commercialization.

#chatgpt#ai-safety#advertising
๐Ÿ‡ฌ๐Ÿ‡ง
The Register - AI/MLโ€ข51d ago

Samsung First Ships HBM4 Memory

Samsung claims first to ship HBM4 memory, a day after Micron's announcement. HBM4 provides faster, denser RAM for next-gen AI hardware. This aligns with Nvidia's Vera Rubin GPU timeline.

#launch#samsung#hbm4
Page 594 of 612