All Updates
Page 593 of 612
February 13, 2026
OpenAI Launches Speedy Codex-Spark Model
OpenAI released GPT-5.3-Codex-Spark, a lightweight version of its Codex intelligent agent programming tool. This slimmed-down model prioritizes extreme inference speed for rapid iteration scenarios. It follows the latest full Codex model released earlier this month.
OpenAI Launches Fast Codex-Spark Variant
OpenAI released GPT-5.3-Codex-Spark, a lightweight version of its Codex programming tool. It targets rapid iteration with extreme inference speed. This follows a recent full Codex model update.
OpenAI Launches Speedy Codex-Spark Variant
OpenAI released GPT-5.3-Codex-Spark, a lightweight version of its Codex intelligent programming tool. Designed for extreme inference speed, it targets rapid iteration scenarios. This follows the recent launch of the latest full Codex model.
Voxtral Realtime Streaming ASR
Voxtral Realtime achieves Whisper-quality transcription at 480ms latency via end-to-end streaming training. Features causal audio encoder and Ada RMS-Norm. Pretrained on 13 languages; model weights released Apache 2.0.
V2G Transforms CAD into Auditable Graphs
V2G pipeline converts CAD diagrams into property graphs, capturing component topology and connectivity overlooked by pixel-based MLLMs. It delivers major accuracy gains on electrical schematic compliance benchmarks where top MLLMs fail. Benchmark and code released on GitHub for further research.
TSR Boosts Multi-Turn RL for LLM Agents
TSR introduces trajectory-search rollouts to enhance multi-turn reinforcement learning for LLM agents. It uses lightweight tree-style search for high-quality trajectories, improving rollout generation and stabilizing training. Achieves up to 15% performance gains on tasks like Sokoban and WebShop.
TRACER Aggregates Risks in Agent Trajectories
TRACER is a trajectory-level uncertainty metric for tool-using agents, combining surprisal, repetition, and coherence signals with tail-focused aggregation. Improves AUROC by 37% and AUARC by 55% on tau^2-bench for failure prediction. Code and benchmark on GitHub.
ThinkRouter Boosts Reasoning Efficiency
ThinkRouter introduces confidence-aware routing between latent and discrete spaces for efficient AI reasoning. It switches to discrete tokens during low-confidence steps to reduce noise from latent embeddings. Experiments show major accuracy gains on STEM and coding tasks while shortening outputs.
Text2GQL-Bench: New Graph Query Benchmark
Text2GQL-Bench introduces a unified benchmark for Text-to-Graph-Query-Language systems with 178,184 question-query pairs across 13 domains and multiple GQLs. It features a scalable dataset generation framework and a multi-metric evaluation including grammatical validity, similarity, semantic alignment, and execution accuracy. Evaluations show LLMs struggle with ISO-GQL, achieving only 4% zero-shot execution accuracy, improving to 50% with 3-shot prompting and 45.1% with fine-tuning.
Surveying Multi-Agent Communication Paradigms
This survey frames multi-agent communication via the Five Ws, tracing evolution from MARL's hand-designed protocols to emergent language and LLM-based systems. It highlights trade-offs in interpretability, scalability, and generalization across paradigms. Practical design patterns and open challenges are distilled for hybrid systems.
SemaPop: Semantic Population Synthesis
SemaPop uses LLMs for semantic-conditioned population synthesis, deriving personas from surveys. Integrates with WGAN-GP for statistical alignment and behavioral realism. Achieves better marginal/joint distribution matches with diversity.
scPilot Automates Single-Cell Analysis
scPilot enables LLMs to reason over single-cell RNA-seq data using natural language and on-demand tools for annotation, trajectories, and TF targeting. Paired with scBench benchmark, it shows gains like 11% accuracy lift via iterative reasoning. Transparent traces explain biological insights.
SCF-RKL Advances Model Merging
SCF-RKL introduces sparse, distribution-aware model merging using reverse KL divergence to minimize interference. It selectively fuses complementary parameters, preserving stable representations and integrating new capabilities. Evaluations on 24 benchmarks show superior performance in reasoning, instruction following, and safety.
Quark Medical Alignment Paradigm Launched
Quark Medical Alignment introduces a holistic multi-dimensional paradigm for aligning large language models in high-stakes medical question answering. It decomposes objectives into four categories with closed-loop optimization using observable metrics, diagnosis, and rewards. A unified mechanism with Reference-Frozen Normalization and Tri-Factor Adaptive Dynamic Weighting resolves scale mismatches and optimization conflicts.
PhyNiKCE Boosts Autonomous CFD Reliability
PhyNiKCE introduces a neurosymbolic agentic framework to overcome LLM limitations in Computational Fluid Dynamics (CFD) simulations. It decouples neural planning from symbolic validation using a Constraint Satisfaction Problem approach to enforce physical laws. Validated on OpenFOAM tasks, it achieves 96% improvement over baselines while cutting self-correction loops by 59% and token use by 17%.
PBSAI Multi-Agent AI Governance
PBSAI provides reference architecture for securing enterprise AI estates with multi-agent systems. Organizes 12 domains via agent families, context envelopes, output contracts. Aligns with NIST AI RMF for SOC and hyperscale defense.
NMIPS: Neuro-Symbolic PDE Solver
NMIPS introduces a unified neuro-symbolic framework for solving PDE families with shared structures but varying parameters. It discovers interpretable analytical solutions via multifactorial optimization and affine transfer for efficiency. Experiments show up to 35.7% accuracy gains over baselines.
Measuring LLM Agent Behavioral Consistency
Study reveals LLM agents like Llama/GPT/Claude produce 2-4 unique action paths per 10 runs on HotpotQA, with inconsistency predicting failure. Consistent runs hit 80-92% accuracy vs 25-60% for inconsistent ones. Variance traces to early decisions like first search query.
MaxExp Optimizes Multispecies Predictions
MaxExp is a decision-driven framework for binarizing probabilistic species distribution models into presence-absence maps by maximizing evaluation metrics. It requires no calibration data and outperforms thresholding methods, especially under class imbalance. SSE provides a simpler alternative using expected species richness.
MathSpatial Exposes MLLMs' Spatial Reasoning Gap
MLLMs excel in perception but fail mathematical spatial reasoning, scoring under 60% on tasks humans solve at 95% accuracy. MathSpatial introduces a framework with MathSpatial-Bench (2K problems), MathSpatial-Corpus (8K training data), and MathSpatial-SRT for structured reasoning. Fine-tuning Qwen2.5-VL-7B achieves strong results with 25% fewer tokens.