AI Updates Aggregator

🤖Reddit r/MachineLearning•Apr 10, 2026Stalecollected in 10h

cuBLAS 60% MatMul Bug on RTX 5090

Post LinkedIn

🤖Read original on Reddit r/MachineLearning

#matmul #gpu-kernel #tma #performance-bugcublascublas rtx-5090 cuda nvidia

💡cuBLAS bug tanks RTX 5090 MatMul by 60%—custom kernel fix beats it 170%!

⚡ 30-Second TL;DR

What Changed

Inefficient kernel affects batched FP32 from 256×256 to 8192×8192×16 on RTX non-Pro GPUs

Why It Matters

Severely impacts ML training/inference perf on consumer RTX GPUs, affecting many workloads. Workaround kernel provides immediate speedup until NVIDIA fixes.

What To Do Next

Implement the shared TMA double-buffer kernel for batched GEMM on RTX 5090.

Who should care:Developers & AI Engineers

Key Points

•Inefficient kernel affects batched FP32 from 256×256 to 8192×8192×16 on RTX non-Pro GPUs
•Custom kernel beats cuBLAS by 120-170% on RTX 5090
•Pro 6000 reaches 73% FMA, H200 82%; RTX gets suboptimal dispatch
•TMA double-buffer kernel achieves 80-120% of proper kernels

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•The issue stems from a heuristic failure in the cuBLAS dispatch engine that incorrectly prioritizes legacy register-file-based tiling over the Blackwell architecture's Tensor Memory Accelerator (TMA) pipeline for specific batched FP32 workloads.
•NVIDIA's internal engineering response suggests the bug is limited to the 'consumer' firmware branch of the RTX 5090, as the same cuBLAS library version correctly dispatches optimized TMA kernels on workstation-class Blackwell silicon.
•Community-developed workarounds involve forcing a custom CUDA graph that bypasses the cuBLAS heuristic layer, effectively 'pinning' the execution to the high-throughput TMA path that the driver currently ignores.

🛠️ Technical Deep Dive

•The bottleneck is identified as a 'stall-on-issue' state in the SM90 scheduler, where the warp scheduler fails to hide latency because the cuBLAS kernel does not utilize the asynchronous copy (cp.async) instructions required for the Blackwell TMA unit.
•The custom kernel implementation utilizes a double-buffering strategy in Shared Memory (SRAM), allowing the TMA to pre-fetch the next tile of the matrix while the current tile is being processed by the Tensor Cores.
•NCU (NVIDIA Nsight Compute) profiling reveals that the cuBLAS-dispatched kernel exhibits a high 'L1/TEX Cache Hit Rate' but suffers from 'Instruction Issue Stall: Barrier' due to the lack of overlap between memory movement and compute cycles.

🔮 Future ImplicationsAI analysis grounded in cited sources

NVIDIA will release a driver-level patch for the Blackwell consumer stack within Q2 2026.

The performance gap is significant enough to impact the RTX 5090's market positioning for local LLM fine-tuning, necessitating a firmware or driver-side heuristic update.

Third-party BLAS libraries will gain market share in the enthusiast AI space.

The inability of the official cuBLAS library to saturate Blackwell hardware for common FP32 operations creates an opening for optimized, community-maintained alternatives.

⏳ Timeline

2025-01

NVIDIA Blackwell architecture officially launches with the RTX 5090.

2026-03

Initial community reports emerge on developer forums regarding sub-optimal MatMul performance.

2026-04

Detailed NCU profiling and custom kernel benchmarks are published on Reddit, confirming the cuBLAS dispatch bug.

🤖Read original article on Reddit r/MachineLearning

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #matmul

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning ↗

⚡ 30-Second TL;DR

Key Points

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

👉Related Updates

Schema harness achieves 99% on ARC-AGI-3 benchmark

New Recurrent Architecture DABSN Seeks Scaling Collaborators

Call for Papers: RTCA Workshop at NeurIPS 2026

Rethinking AI Memory: Beyond Fact Storage to Pattern Inference