All Updates

Page 708 of 751

February 17, 2026

πŸ“„
ArXiv AIβ€’61d ago

PlotChain Benchmark for MLLM Plot Reading

PlotChain introduces a deterministic benchmark for evaluating multimodal LLMs on extracting quantitative values from engineering plots like Bode and FFT. It features 450 plots across 15 families with ground truth and checkpoint diagnostics for failure analysis. Top models score ~80% (Gemini 2.5 Pro leads), but frequency tasks remain weak.

#multimodal-llm#engineering-plots
πŸ“„
ArXiv AIβ€’61d ago

NL2LOGIC: 99% Accurate NL-to-FOL Translation

NL2LOGIC is a new framework using abstract syntax trees (AST) to translate natural language into first-order logic via large language models. It combines a recursive LLM semantic parser with an AST-guided generator for high syntactic accuracy and semantic faithfulness. Benchmarks show 99% syntactic accuracy, up to 30% semantic gains, and 31% reasoning improvement when integrated with Logic-LM.

#abstract-syntax-tree#first-order-logic#semantic-parsing
πŸ“„
ArXiv AIβ€’61d ago

MAPLE: Sub-Agent Design for AI Personalization

MAPLE decomposes LLM agent limitations by separating memory, learning, and personalization into dedicated sub-agents. Memory manages storage/retrieval, Learning extracts insights asynchronously, and Personalization applies them in real-time. It boosts personalization scores by 14.6% and trait incorporation from 45% to 75% on MAPLE-Personas benchmark.

#sub-agents#agentic-ai#personalization
πŸ“„
ArXiv AIβ€’61d ago

Lang2Act Boosts VLM Visual Reasoning with Emergent Tools

Lang2Act enhances Vision-Language Models (VLMs) via self-emergent linguistic toolchains for fine-grained visual perception in VRAG, avoiding rigid external tools and info loss from image ops. It employs a two-stage RL framework: first to build a reusable action toolbox, second to exploit it for reasoning. Achieves >4% performance gains; code at GitHub.

#visual-reasoning
πŸ“„
ArXiv AIβ€’61d ago

Geometric Taxonomy of LLM Hallucinations

Researchers propose a geometric taxonomy classifying LLM hallucinations into three types: unfaithfulness, confabulation, and factual error. Benchmark hallucinations show strong domain-local detection but fail cross-domain, while human-crafted confabulations enable a single global detection direction. Factual errors remain undetectable via embeddings due to distributional encoding limits.

#research#llms#hallucinations
πŸ“„
ArXiv AIβ€’61d ago

Dual-Cycle Framework for Safe Role-Playing LLMs

A training-free Dual-Cycle Adversarial Self-Evolution framework addresses jailbreak vulnerabilities in LLM role-playing agents. It couples a Persona-Targeted Attacker cycle for stronger jailbreaks with a Role-Playing Defender cycle that distills failures into a hierarchical safety knowledge base. At inference, it retrieves structured knowledge to ensure in-character yet safe responses, outperforming baselines in fidelity and resistance.

#jailbreak-resistance#role-playing#self-evolution
πŸ“„
ArXiv AIβ€’61d ago

DPBench Reveals LLM Coordination Failures

DPBench introduces a benchmark for LLM multi-agent coordination using the Dining Philosophers problem across eight conditions varying timing, group size, and communication. Tests on GPT-5.2, Claude Opus 4.5, and Grok 4.1 show strong sequential performance but >95% deadlock in simultaneous settings due to convergent reasoning. Communication fails to help and may worsen deadlocks; open-sourced on GitHub.

#multi-agent#dining-philosophers#deadlock
πŸ“„
ArXiv AIβ€’61d ago

BotzoneBench: Scalable LLM Game Eval

BotzoneBench offers a scalable framework for evaluating LLMs' strategic reasoning in games using fixed hierarchies of skill-calibrated game AIs. It avoids quadratic costs and instability of LLM-vs-LLM tournaments by providing absolute, linear-time measurements. Across eight diverse games, it assessed 177,047 state-action pairs from five flagship models, highlighting performance gaps.

#research#botzonebench#llm
πŸ“„
ArXiv AIβ€’61d ago

BotzoneBench: Scalable LLM Game Eval Benchmark

BotzoneBench introduces a scalable framework for evaluating LLMs' strategic reasoning in interactive games using fixed hierarchies of skill-calibrated game AIs. It assesses five flagship models across eight diverse games via 177,047 state-action pairs, revealing performance gaps and behaviors comparable to mid-tier game AIs. This enables linear-time absolute measurements with stable interpretability, unlike volatile LLM-vs-LLM rankings.

#research#botzonebench#llm
πŸ“„
ArXiv AIβ€’61d ago

AST-PAC Enhances Code MIA with AST Guidance

Researchers introduce AST-PAC, a syntax-aware adaptation of PAC for membership inference attacks on code LLMs. It uses AST-based perturbations to create valid calibration samples, outperforming baselines on larger files but facing limits on small or alphanumeric-rich code. The work calls for syntax-adaptive methods to audit code model training data.

#membership-inference#ast-perturbations#code-provenance
πŸ“„
ArXiv AIβ€’61d ago

AMOR: Entropy-Gated SSM-Attention Hybrid

AMOR is a hybrid model inspired by dual-process cognition theories, dynamically activating sparse attention only when SSM predictions show high entropy uncertainty. It projects Ghost KV from SSM states for O(n) efficiency, outperforming SSM-only and Transformer baselines on retrieval tasks with perfect accuracy using just 22% attention positions. Prediction entropy reliably detects retrieval needs with a 1.09 nats gap.

#research#amor#architecture
πŸ“„
ArXiv AIβ€’61d ago

Agentic AI Improves Insurance Underwriting Safety

New agentic AI system for commercial insurance underwriting uses adversarial self-critique to challenge decisions and enhance reliability. It reduces hallucinations from 11.3% to 3.8% and boosts accuracy to 96% on 500 expert cases. Human oversight ensures accountability in regulated environments.

#research#agentic-ai#self-critique
πŸ“„
ArXiv AIβ€’61d ago

Adversarial Self-Critique for Safer AI Underwriting

New agentic AI system for commercial insurance underwriting uses adversarial self-critique where a critic agent challenges primary decisions before human review. It reduces hallucinations from 11.3% to 3.8% and boosts accuracy from 92% to 96% on 500 expert cases. The human-in-the-loop design ensures oversight in regulated environments.

#research#agentic-ai#self-critique
πŸ—Ύ
ITmedia AI+ (ζ—₯本)β€’61d ago

Google's Dev Knowledge API for GenAI Docs

Google announced public preview of Developer Knowledge API and MCP Server. It enables generative AI to retrieve and reference official documentation for Google Cloud, Android, and Firebase. Supports Model Context Protocol for seamless integration.

#rag#developer-tools#protocol
πŸ—Ύ
ITmedia AI+ (ζ—₯本)β€’61d ago

Fujitsu Takane Automates Dev 100x Faster

Fujitsu launched AI-Driven Software Development Platform using proprietary LLM Takane. It automates the entire software development process. Proof-of-concept showed 100x productivity gains in some cases.

#automation#productivity#japan-llm
πŸ”₯
36ζ°ͺβ€’61d ago

Doubao AI Logs 1.9B Interactions on New Year's Eve

ByteDance's Doubao AI assistant, partnering with CCTV Spring Gala on Feb 16, hit 1.9 billion interactions. The 'Doubao New Year' campaign generated over 50 million festive avatars and 100 million blessing messages.

#spring-gala#user-engagement#avatar-gen
πŸ’°
ι’›εͺ’体‒61d ago

AI to Replace White-Collar Jobs Soon

Microsoft AI head warns most white-collar jobs will be fully replaced by AI in 12-18 months. Globally, 55,000 positions were lost to AI by 2025. Traditional office social contracts face collapse.

#microsoft#ai#employment
πŸ—Ύ
ITmedia AI+ (ζ—₯本)β€’61d ago

AI in Design/Analysis: Survey on Reality & Challenges

MONOist/TechFactory編集部 conducted a 2025 survey on AI utilization in design and analysis tasks from October 7 to November 3, receiving 406 valid responses. The report details the current realities and challenges faced in these engineering workflows. Full results are presented in a comprehensive report format.

#engineering-survey#ai-adoption#design-challenges
πŸ”₯
36ζ°ͺβ€’61d ago

1.3B Users Debut Qwen AI Shopping at Spring Fest

Qwen data reveals over 130M first-time AI shopping users during Spring Festival, with 5B 'Qwen help me' queries. Ticket orders rose 22x, travel 7x, movie tickets 372xβ€”782x from tier 3-4 cities. Half of AI orders from counties; nearly 400K users over 60 participated.

#ai-shopping#voice-assistant#user-growth
πŸ’°
ι’›εͺ’体‒61d ago

30 Years Robot Evolution in Spring Gala

The Chinese Spring Festival Gala has evolved from comedian Cai Ming impersonating a robot to actual robots accompanying her over 30 years. This progression is likened to a multi-decade Turing test. It reflects capital's shift from virtual hype to tangible robotics development.

#spring-gala#robotics#evolution
Page 708 of 751