๐Ÿค–Freshcollected in 10h

cuBLAS 60% MatMul Bug on RTX 5090

PostLinkedIn
๐Ÿค–Read original on Reddit r/MachineLearning

๐Ÿ’กcuBLAS bug tanks RTX 5090 MatMul by 60%โ€”custom kernel fix beats it 170%!

โšก 30-Second TL;DR

What Changed

Inefficient kernel affects batched FP32 from 256ร—256 to 8192ร—8192ร—16 on RTX non-Pro GPUs

Why It Matters

Severely impacts ML training/inference perf on consumer RTX GPUs, affecting many workloads. Workaround kernel provides immediate speedup until NVIDIA fixes.

What To Do Next

Implement the shared TMA double-buffer kernel for batched GEMM on RTX 5090.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขThe issue stems from a heuristic failure in the cuBLAS dispatch engine that incorrectly prioritizes legacy register-file-based tiling over the Blackwell architecture's Tensor Memory Accelerator (TMA) pipeline for specific batched FP32 workloads.
  • โ€ขNVIDIA's internal engineering response suggests the bug is limited to the 'consumer' firmware branch of the RTX 5090, as the same cuBLAS library version correctly dispatches optimized TMA kernels on workstation-class Blackwell silicon.
  • โ€ขCommunity-developed workarounds involve forcing a custom CUDA graph that bypasses the cuBLAS heuristic layer, effectively 'pinning' the execution to the high-throughput TMA path that the driver currently ignores.

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขThe bottleneck is identified as a 'stall-on-issue' state in the SM90 scheduler, where the warp scheduler fails to hide latency because the cuBLAS kernel does not utilize the asynchronous copy (cp.async) instructions required for the Blackwell TMA unit.
  • โ€ขThe custom kernel implementation utilizes a double-buffering strategy in Shared Memory (SRAM), allowing the TMA to pre-fetch the next tile of the matrix while the current tile is being processed by the Tensor Cores.
  • โ€ขNCU (NVIDIA Nsight Compute) profiling reveals that the cuBLAS-dispatched kernel exhibits a high 'L1/TEX Cache Hit Rate' but suffers from 'Instruction Issue Stall: Barrier' due to the lack of overlap between memory movement and compute cycles.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

NVIDIA will release a driver-level patch for the Blackwell consumer stack within Q2 2026.
The performance gap is significant enough to impact the RTX 5090's market positioning for local LLM fine-tuning, necessitating a firmware or driver-side heuristic update.
Third-party BLAS libraries will gain market share in the enthusiast AI space.
The inability of the official cuBLAS library to saturate Blackwell hardware for common FP32 operations creates an opening for optimized, community-maintained alternatives.

โณ Timeline

2025-01
NVIDIA Blackwell architecture officially launches with the RTX 5090.
2026-03
Initial community reports emerge on developer forums regarding sub-optimal MatMul performance.
2026-04
Detailed NCU profiling and custom kernel benchmarks are published on Reddit, confirming the cuBLAS dispatch bug.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ†—