All Updates
Page 708 of 751
February 17, 2026
PlotChain Benchmark for MLLM Plot Reading
PlotChain introduces a deterministic benchmark for evaluating multimodal LLMs on extracting quantitative values from engineering plots like Bode and FFT. It features 450 plots across 15 families with ground truth and checkpoint diagnostics for failure analysis. Top models score ~80% (Gemini 2.5 Pro leads), but frequency tasks remain weak.
NL2LOGIC: 99% Accurate NL-to-FOL Translation
NL2LOGIC is a new framework using abstract syntax trees (AST) to translate natural language into first-order logic via large language models. It combines a recursive LLM semantic parser with an AST-guided generator for high syntactic accuracy and semantic faithfulness. Benchmarks show 99% syntactic accuracy, up to 30% semantic gains, and 31% reasoning improvement when integrated with Logic-LM.
MAPLE: Sub-Agent Design for AI Personalization
MAPLE decomposes LLM agent limitations by separating memory, learning, and personalization into dedicated sub-agents. Memory manages storage/retrieval, Learning extracts insights asynchronously, and Personalization applies them in real-time. It boosts personalization scores by 14.6% and trait incorporation from 45% to 75% on MAPLE-Personas benchmark.
Lang2Act Boosts VLM Visual Reasoning with Emergent Tools
Lang2Act enhances Vision-Language Models (VLMs) via self-emergent linguistic toolchains for fine-grained visual perception in VRAG, avoiding rigid external tools and info loss from image ops. It employs a two-stage RL framework: first to build a reusable action toolbox, second to exploit it for reasoning. Achieves >4% performance gains; code at GitHub.
Geometric Taxonomy of LLM Hallucinations
Researchers propose a geometric taxonomy classifying LLM hallucinations into three types: unfaithfulness, confabulation, and factual error. Benchmark hallucinations show strong domain-local detection but fail cross-domain, while human-crafted confabulations enable a single global detection direction. Factual errors remain undetectable via embeddings due to distributional encoding limits.
Dual-Cycle Framework for Safe Role-Playing LLMs
A training-free Dual-Cycle Adversarial Self-Evolution framework addresses jailbreak vulnerabilities in LLM role-playing agents. It couples a Persona-Targeted Attacker cycle for stronger jailbreaks with a Role-Playing Defender cycle that distills failures into a hierarchical safety knowledge base. At inference, it retrieves structured knowledge to ensure in-character yet safe responses, outperforming baselines in fidelity and resistance.
DPBench Reveals LLM Coordination Failures
DPBench introduces a benchmark for LLM multi-agent coordination using the Dining Philosophers problem across eight conditions varying timing, group size, and communication. Tests on GPT-5.2, Claude Opus 4.5, and Grok 4.1 show strong sequential performance but >95% deadlock in simultaneous settings due to convergent reasoning. Communication fails to help and may worsen deadlocks; open-sourced on GitHub.
BotzoneBench: Scalable LLM Game Eval
BotzoneBench offers a scalable framework for evaluating LLMs' strategic reasoning in games using fixed hierarchies of skill-calibrated game AIs. It avoids quadratic costs and instability of LLM-vs-LLM tournaments by providing absolute, linear-time measurements. Across eight diverse games, it assessed 177,047 state-action pairs from five flagship models, highlighting performance gaps.
BotzoneBench: Scalable LLM Game Eval Benchmark
BotzoneBench introduces a scalable framework for evaluating LLMs' strategic reasoning in interactive games using fixed hierarchies of skill-calibrated game AIs. It assesses five flagship models across eight diverse games via 177,047 state-action pairs, revealing performance gaps and behaviors comparable to mid-tier game AIs. This enables linear-time absolute measurements with stable interpretability, unlike volatile LLM-vs-LLM rankings.
AST-PAC Enhances Code MIA with AST Guidance
Researchers introduce AST-PAC, a syntax-aware adaptation of PAC for membership inference attacks on code LLMs. It uses AST-based perturbations to create valid calibration samples, outperforming baselines on larger files but facing limits on small or alphanumeric-rich code. The work calls for syntax-adaptive methods to audit code model training data.
AMOR: Entropy-Gated SSM-Attention Hybrid
AMOR is a hybrid model inspired by dual-process cognition theories, dynamically activating sparse attention only when SSM predictions show high entropy uncertainty. It projects Ghost KV from SSM states for O(n) efficiency, outperforming SSM-only and Transformer baselines on retrieval tasks with perfect accuracy using just 22% attention positions. Prediction entropy reliably detects retrieval needs with a 1.09 nats gap.
Agentic AI Improves Insurance Underwriting Safety
New agentic AI system for commercial insurance underwriting uses adversarial self-critique to challenge decisions and enhance reliability. It reduces hallucinations from 11.3% to 3.8% and boosts accuracy to 96% on 500 expert cases. Human oversight ensures accountability in regulated environments.
Adversarial Self-Critique for Safer AI Underwriting
New agentic AI system for commercial insurance underwriting uses adversarial self-critique where a critic agent challenges primary decisions before human review. It reduces hallucinations from 11.3% to 3.8% and boosts accuracy from 92% to 96% on 500 expert cases. The human-in-the-loop design ensures oversight in regulated environments.
Google's Dev Knowledge API for GenAI Docs
Google announced public preview of Developer Knowledge API and MCP Server. It enables generative AI to retrieve and reference official documentation for Google Cloud, Android, and Firebase. Supports Model Context Protocol for seamless integration.
Fujitsu Takane Automates Dev 100x Faster
Fujitsu launched AI-Driven Software Development Platform using proprietary LLM Takane. It automates the entire software development process. Proof-of-concept showed 100x productivity gains in some cases.
Doubao AI Logs 1.9B Interactions on New Year's Eve
ByteDance's Doubao AI assistant, partnering with CCTV Spring Gala on Feb 16, hit 1.9 billion interactions. The 'Doubao New Year' campaign generated over 50 million festive avatars and 100 million blessing messages.
AI to Replace White-Collar Jobs Soon
Microsoft AI head warns most white-collar jobs will be fully replaced by AI in 12-18 months. Globally, 55,000 positions were lost to AI by 2025. Traditional office social contracts face collapse.
AI in Design/Analysis: Survey on Reality & Challenges
MONOist/TechFactoryη·¨ιι¨ conducted a 2025 survey on AI utilization in design and analysis tasks from October 7 to November 3, receiving 406 valid responses. The report details the current realities and challenges faced in these engineering workflows. Full results are presented in a comprehensive report format.
1.3B Users Debut Qwen AI Shopping at Spring Fest
Qwen data reveals over 130M first-time AI shopping users during Spring Festival, with 5B 'Qwen help me' queries. Ticket orders rose 22x, travel 7x, movie tickets 372xβ782x from tier 3-4 cities. Half of AI orders from counties; nearly 400K users over 60 participated.
30 Years Robot Evolution in Spring Gala
The Chinese Spring Festival Gala has evolved from comedian Cai Ming impersonating a robot to actual robots accompanying her over 30 years. This progression is likened to a multi-decade Turing test. It reflects capital's shift from virtual hype to tangible robotics development.