All Updates
Page 565 of 637
February 20, 2026
IndicJR: Judge-Free Indic Jailbreak Benchmark
IndicJR introduces a judge-free benchmark evaluating jailbreak robustness in 12 South Asian languages with 45,216 prompts across JSON and Free tracks. It uncovers that contracts boost refusals but fail against jailbreaks, English attacks transfer effectively to Indic, and orthography like romanization weakens defenses. The benchmark provides a reproducible multilingual stress test for LLM safety.
GUI-Owl-1.5 Tops 20+ GUI Benchmarks
GUI-Owl-1.5 introduces multi-size native GUI agent models (2B-235B) supporting desktop, mobile, browser platforms for cloud-edge collaboration. It sets SOTA on 20+ benchmarks like 56.5 on OSWorld, 71.6 on AndroidWorld, and 80.3 on ScreenSpotPro. Open-sourced with innovations in data flywheel, agent reasoning, and multi-platform RL.
GAP: Text Safety Fails for LLM Agent Tools
Researchers introduce the GAP benchmark to evaluate divergence between text-level and tool-call safety in LLM agents. Testing six frontier models across six domains reveals text refusals do not prevent harmful tool calls, with 219 persistent cases even under safety prompts. The study urges dedicated tool-call safety measures beyond text evaluations.
Contextuality Inevitable in Single-State AI
Adaptive systems reuse fixed internal states across contexts due to resource limits, leading to inevitable contextuality in classical probabilistic models. The paper proves an irreducible information-theoretic cost for reproducing contextual statistics. Nonclassical frameworks avoid this without quantum mechanics by lacking a global joint probability space.
AIdentifyAGE Ontology Standardizes Forensic Dental AI
AIdentifyAGE ontology provides a standardized framework for forensic dental age assessment, supporting manual and AI-assisted workflows. It integrates clinical, forensic, legal data, radiographic imaging, and ML methods for interoperability and transparency. Developed with experts, it builds on biomedical ontologies and adheres to FAIR principles.
AI Improves 50-Year Hypercube Slicing Bounds
Researchers prove S(n) ≤ ⌈4n/5⌉ for hypercube edge slicing, beating 1971's ⌈5n/6⌉ bound. They used CPro1, an LLM-powered tool, to construct 8 hyperplanes slicing Q_{10}. New lower bounds on edges sliced by k<n hyperplanes are also established.
AI Benchmarks Saturate Quickly Study
A systematic ArXiv study analyzes saturation across 60 LLM benchmarks from major developers. Nearly half show saturation, worsening with age, and hiding test data offers no protection. Expert-curated benchmarks resist saturation better than crowdsourced ones.
AgentLAB Benchmarks LLM Agents on Long-Horizon Attacks
AgentLAB is the first benchmark evaluating LLM agents' vulnerability to adaptive long-horizon attacks via multi-turn interactions. It features five attack types—intent hijacking, tool chaining, task injection, objective drifting, memory poisoning—across 28 environments and 644 test cases. Evaluations reveal high susceptibility in agents, with single-turn defenses failing to mitigate threats.
Product Sense Beats Coding in Vibe Coding Era
In the Vibe Coding era powered by tools like Claude Code, product sense outweighs traditional coding skills as non-programmers build full AI agents via conversation. Demos externalize ideas, build trust, and lower barriers from concept to product. Six core techniques include basing on existing GitHub projects, problem-driven AI queries, and modular progressive development.
Altman: Superint to Top CEOs by 2028
OpenAI CEO Sam Altman predicts an early version of true superintelligence in just a few years. By end-2028, more global intelligence resources will be in data centers than outside. Superintelligence will outperform top company CEOs—including himself—and leading scientists.
Memory Giants Ramp Factories for AI Demand
Micron, Samsung, and SK Hynix are massively expanding fabs to meet AI-driven memory needs. Micron's $200B plan features a huge Boise campus with 15-20万 WPM capacity. Priority for HBM and AI modules means ongoing consumer shortages.
China Bans 'Whole Net Lowest Price' Claims
China's SAMR released anti-monopoly guidelines for internet platforms, targeting practices like 'whole network lowest price' as monopoly risks. It prohibits algorithm-driven price coordination, big data kill-mature pricing, blocking competitors, and predatory below-cost sales. This shifts antitrust from case-by-case to full-scenario rule reconstruction affecting all platforms.
Blue-Collar Stock Tops Nvidia 5-Year Gains
Unnamed blue-collar stock outperformed Nvidia over past 5 years. It retains significant upside potential. 'Safety hat' operations, including data centers, profit from AI boom.
Alibaba Qianwen Hits 130M Orders in Festival
Jefferies reports Alibaba's AI app Qianwen generated over 130 million orders during Spring Festival promotions, with user trust rising. About half came from county-level markets for items like milk tea, movie tickets, and daily goods; 4 million users aged 60+ used AI for transactions first time. Tencent's Yuanbao reached 50M daily active users and 3.6B lottery draws.
EvoMap Launches AI Agent DNA Protocol
EvoMap is an open A2A gateway protocol that enables AI agents to inherit, share, and evolve capabilities like genes via standardized Capsules. Originating from OpenClaw plugin issues and acquisition concerns, it allows easy integration of agents from platforms like OpenClaw for skill publishing and task delegation. The GEP protocol encapsulates successful strategies as verifiable assets for network-wide evolution.
China AI Startups Shares Surge Post-Holiday
Shares of China’s generative AI startups Zhipu and MiniMax soared in Hong Kong after the Lunar New Year holiday. Investors rotated into pure AI plays from traditional internet giants as the market reopened.
Don't Be Fooled by China's 'Hundred Models War'
The article cautions against hype surrounding China's 'hundred models war' in AI. It argues that competition is evolving from a single dimension into two parallel development paths. While outcomes are undecided, the strategic direction has become clear.
Top Linux Distros for Home Lab Servers
Author recommends four favorite Linux server distros for home labs. Ideal for bare metal servers or virtual machines. Emphasizes rock-solid reliability for stable setups.
Nvidia Nears $30B OpenAI Investment
Nvidia is reportedly finalizing a $30 billion investment in OpenAI, replacing a $100 billion long-term commitment from last year between the two companies. This investment forms part of OpenAI's latest funding round.
Meta Slashes Employee Equity 5% for AI
Mark Zuckerberg is cutting costs to allocate funds for massive AI expenditures, resulting in a 5% reduction in equity rewards for most Meta employees. This move prioritizes AI investment amid rising compute demands.