๐คReddit r/MachineLearningโขFreshcollected in 10h
cuBLAS 60% MatMul Bug on RTX 5090
๐กcuBLAS bug tanks RTX 5090 MatMul by 60%โcustom kernel fix beats it 170%!
โก 30-Second TL;DR
What Changed
Inefficient kernel affects batched FP32 from 256ร256 to 8192ร8192ร16 on RTX non-Pro GPUs
Why It Matters
Severely impacts ML training/inference perf on consumer RTX GPUs, affecting many workloads. Workaround kernel provides immediate speedup until NVIDIA fixes.
What To Do Next
Implement the shared TMA double-buffer kernel for batched GEMM on RTX 5090.
Who should care:Developers & AI Engineers
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขThe issue stems from a heuristic failure in the cuBLAS dispatch engine that incorrectly prioritizes legacy register-file-based tiling over the Blackwell architecture's Tensor Memory Accelerator (TMA) pipeline for specific batched FP32 workloads.
- โขNVIDIA's internal engineering response suggests the bug is limited to the 'consumer' firmware branch of the RTX 5090, as the same cuBLAS library version correctly dispatches optimized TMA kernels on workstation-class Blackwell silicon.
- โขCommunity-developed workarounds involve forcing a custom CUDA graph that bypasses the cuBLAS heuristic layer, effectively 'pinning' the execution to the high-throughput TMA path that the driver currently ignores.
๐ ๏ธ Technical Deep Dive
- โขThe bottleneck is identified as a 'stall-on-issue' state in the SM90 scheduler, where the warp scheduler fails to hide latency because the cuBLAS kernel does not utilize the asynchronous copy (cp.async) instructions required for the Blackwell TMA unit.
- โขThe custom kernel implementation utilizes a double-buffering strategy in Shared Memory (SRAM), allowing the TMA to pre-fetch the next tile of the matrix while the current tile is being processed by the Tensor Cores.
- โขNCU (NVIDIA Nsight Compute) profiling reveals that the cuBLAS-dispatched kernel exhibits a high 'L1/TEX Cache Hit Rate' but suffers from 'Instruction Issue Stall: Barrier' due to the lack of overlap between memory movement and compute cycles.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
NVIDIA will release a driver-level patch for the Blackwell consumer stack within Q2 2026.
The performance gap is significant enough to impact the RTX 5090's market positioning for local LLM fine-tuning, necessitating a firmware or driver-side heuristic update.
Third-party BLAS libraries will gain market share in the enthusiast AI space.
The inability of the official cuBLAS library to saturate Blackwell hardware for common FP32 operations creates an opening for optimized, community-maintained alternatives.
โณ Timeline
2025-01
NVIDIA Blackwell architecture officially launches with the RTX 5090.
2026-03
Initial community reports emerge on developer forums regarding sub-optimal MatMul performance.
2026-04
Detailed NCU profiling and custom kernel benchmarks are published on Reddit, confirming the cuBLAS dispatch bug.
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ