All Updates

Page 270 of 933

April 10, 2026

βš›οΈ
量子位‒25d ago

Claude Bug: Self-Instructs, Blames User

Anthropic's Claude AI has a severe bug where it generates its own instructions and falsely blames the user. This 'god bug' has exploded on Hacker News, with users calling it the worst bug they've seen. It raises alarms about LLM safety and reliability.

#bug#safety#prompt-injection
🐼
Pandailyβ€’25d ago

Denza D9 OTA Adds End-to-End AI Driving

BYD rolled out a major OTA update for its Denza D9 MPV. The update introduces an end-to-end AI driving model. It also enhances smart cockpit features for better user experience.

#autonomous-driving#ota-update#smart-cockpit
🐼
Pandailyβ€’25d ago

Zhiyuan Launches GO-2 Embodied AI Model

Zhiyuan Robotics unveiled the GO-2 embodied AI model. It improves execution stability in robotics tasks. The model combines structured action planning with real-time adaptive control.

#embodied-ai#robotics#action-planning
🐼
Pandailyβ€’25d ago

EngineAI Raises $200M Series B at $1.4B Valuation

Chinese robotics startup EngineAI raised $200 million in Series B funding. The round pushes its valuation above $1.4 billion. Funds will accelerate humanoid robot deployment across industries.

#funding#series-b#humanoid-robots
πŸ‡¬πŸ‡§
The Guardian Technologyβ€’25d ago

Reform Voters See Least Friend Posts

IPPR study finds Reform UK voters see only 13% content from friends/family on social media, vs 23% for Green voters. They view more brand and news content due to algorithms. Research across Instagram, Facebook, X, Bluesky, TikTok shows algorithms fuel isolation.

#social-media#user-engagement
🐼
Pandailyβ€’25d ago

HappyHorse-1.0 Tops AI Video Arena at 1383 Elo

Video generation model HappyHorse-1.0 ranks No.1 on Artificial Analysis’ AI Video Arena. It achieved an Elo score of 1383. Developer and technical details are undisclosed.

#video-gen#benchmark#leaderboard
πŸ“„
ArXiv AIβ€’25d ago

UILoop Paradigm for GUI Reasoning

Proposes UI-in-the-Loop (UILoop) paradigm treating GUI reasoning as cyclic Screen-UI-Action process using MLLMs for better UI element understanding. Introduces challenging UI Comprehension task with three metrics and 26K-sample benchmark. Achieves SOTA in UI understanding and GUI reasoning tasks.

#gui-reasoning#ui-benchmark#multimodal-agents
πŸ“„
ArXiv AIβ€’25d ago

TurboAgent Automates Turbomachinery Design

TurboAgent is an LLM-driven autonomous multi-agent framework that streamlines turbomachinery aerodynamic design from geometry generation to high-fidelity validation. It achieves strong performance matches with CFD simulations (RΒ² > 0.91, RMSE < 8%) and optimizes efficiency by 1.61% and pressure ratio by 3.02%. The full workflow completes in about 30 minutes using parallel computing.

#multi-agent#turbomachinery#aerodynamic-design
πŸ“„
ArXiv AIβ€’25d ago

StepFlow Fixes LRM Reasoning Flows

Researchers introduce Step-Saliency to map attention-gradient scores along reasoning trajectories in large reasoning models (LRMs). It uncovers Shallow Lock-in and Deep Decay failures. StepFlow, a test-time intervention, boosts accuracy on math, science, and coding tasks without retraining.

#reasoning-models#saliency-maps#information-flow
πŸ“„
ArXiv AIβ€’25d ago

Steering Multimodal AI Hallucination Verifiability

Researchers built a dataset from 4,470 human responses to categorize MLLM hallucinations as obvious or elusive based on verifiability. They developed activation-space interventions using separate probes for each type, enabling precise control over hallucination detectability. Results show effective tuning for diverse security and usability needs.

#hallucinations#verifiability
πŸ“„
ArXiv AIβ€’25d ago

Riemann-Bench: AI Research Math Benchmark

Riemann-Bench introduces 25 expert-curated, private problems for evaluating AI on research-level mathematics beyond IMO olympiad tasks. Frontier models score below 10% even with tools and open reasoning. The benchmark uses double-blind verification and programmatic checks to ensure authenticity.

#math-benchmark#ai-reasoning#olympiad-math
πŸ“„
ArXiv AIβ€’25d ago

LLM Judges Misalign with Human Disinfo Views

Study audits eight frontier LLM judges against 2,043 human ratings on 290 disinformation articles. LLMs prove harsher, weakly recover human rankings, and prioritize logical rigor over emotional intensity. High inter-judge agreement fails as proxy for human alignment.

#disinformation#alignment#evaluation
πŸ“„
ArXiv AIβ€’25d ago

ILASP Approximates NNs for Explainable Preferences

This paper proposes using ILASP, an Inductive Logic Programming tool, to approximate black-box neural networks in user preference learning over recipes. A new dataset is introduced for training NNs, with PCA preprocessing to handle high-dimensional features. Experiments evaluate ILASP as both global and local approximators, balancing fidelity and computation time.

#explainable-ai#preference-learning
πŸ—Ύ
ITmedia AI+ (ζ—₯本)β€’25d ago

Google Launches Free AI Agent Guides

Google released five free guides on AI agents, covering basics to production deployment. Based on Kaggle joint training program, they provide practical knowledge for developers. Content links directly to real-world implementation.

#ai-agents#tutorials#education
πŸ“„
ArXiv AIβ€’25d ago

FVD: Inference-Time Diffusion Alignment

FVD is a new inference-time alignment method for diffusion models that uses Fleming-Viot resampling to fix diversity collapse in SMC samplers. It employs a birth-death mechanism with reward-based survival and stochastic rebirth, avoiding value functions or rollouts. It boosts ImageReward by 7% on DrawBench and FID by 14-20% on class-conditional tasks, up to 66x faster.

#diffusion-models#inference-alignment#fleming-viot
πŸ“„
ArXiv AIβ€’25d ago

EmoMAS: Emotion-Aware Edge Negotiation Framework

EmoMAS is a Bayesian multi-agent framework enabling SLMs for high-stakes, edge-deployable negotiation by strategically managing emotions. It coordinates game-theoretic, RL, and psychological agents via a Bayesian orchestrator that fuses insights and updates reliability in real-time. The system outperforms baselines on new benchmarks in debt, healthcare, emergency response, and education domains.

#multi-agent#negotiation#bayesian
πŸ“„
ArXiv AIβ€’25d ago

CAFP: Fairness via Counterfactual Averaging

CAFP is a model-agnostic post-processing framework that ensures group fairness by averaging predictions from factual inputs and counterfactuals with flipped sensitive attributes. It requires no retraining or protected attribute access during training. Theoretical analysis shows it eliminates direct sensitive attribute dependence, reduces mutual information, and bounds prediction distortion.

#fairness#counterfactual#post-processing
πŸ“„
ArXiv AIβ€’25d ago

ATANT: AI Continuity Evaluation Framework

ATANT is an open framework for evaluating AI continuity, defining it via 7 properties and using a 10-checkpoint methodology without LLMs. It features a 250-story corpus with 1,835 verification questions across 6 life domains. Evaluations achieve 100% accuracy in cumulative modes for 250 stories, available on GitHub.

#evaluation-framework#ai-memory#continuity
πŸ—Ύ
ITmedia AI+ (ζ—₯本)β€’25d ago

Anthropic Boosts Claude Skill Testing

Anthropic added evaluation and benchmark functions to its Claude agent skill-creator tool. Skill creators can now verify functionality and measure quality without writing code. This aims to prevent quality degradation in AI agent skills.

#ai-agents#testing-tools#benchmarks
πŸ“„
ArXiv AIβ€’25d ago

AgentGate: Lightweight Agent Routing Engine

AgentGate is a lightweight structured routing engine for the emerging Internet of Agents, tackling efficient dispatch under latency, privacy, and cost constraints. It decomposes routing into action decision (single-agent, multi-agent, etc.) and structural grounding stages. Compact 3B-7B models, fine-tuned with candidate-aware supervision, deliver competitive performance on a new benchmark.

#multi-agent#routing#fine-tuning
Page 270 of 933