๐Ÿฆ™Stalecollected in 3h

FlashAttention-4: 1613 TFLOPs/s on Blackwell

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กAttention kernels now match matmul speed on Blackwellโ€”huge for fast inference

โšก 30-Second TL;DR

What Changed

1613 TFLOPs/s BF16 forward on B200 (71% utilization)

Why It Matters

Dramatically boosts inference speed on new NVIDIA GPUs, making attention as fast as matmul. Enables faster local LLM serving on B200/H100. Python kernel unlocks rapid iteration for developers.

What To Do Next

Update to vLLM 0.17.0 and test on B200 for automatic FlashAttention-4 gains.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขFlashAttention-4 utilizes a novel 'persistent kernel' architecture that minimizes global memory round-trips by keeping intermediate attention states in the B200's increased on-chip SRAM, a departure from the tiling strategies used in FlashAttention-2.
  • โ€ขThe CuTe-DSL implementation allows for just-in-time (JIT) specialization of the attention kernel based on specific sequence lengths and head dimensions, reducing the overhead typically associated with static kernel compilation.
  • โ€ขIntegration with PyTorch FlexAttention enables dynamic switching between FlashAttention-4 and standard kernels based on hardware detection, allowing for seamless deployment across mixed-GPU clusters containing both Hopper and Blackwell architectures.
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureFlashAttention-4Triton (Standard)cuDNN 9.13
ArchitectureCuTe-DSL (Python)Triton-DSLC++/CUDA
B200 Performance1613 TFLOPs/s~600 TFLOPs/s~1240 TFLOPs/s
Compilation Time2.5sVariable (JIT)Pre-compiled
FlexibilityHigh (JIT specialized)HighLow (Fixed kernels)

๐Ÿ› ๏ธ Technical Deep Dive

  • Kernel Architecture: Employs a persistent thread-block design that maintains KV-cache blocks in registers and L1 cache across multiple iterations, significantly reducing HBM bandwidth pressure.
  • CuTe-DSL Utilization: Leverages the CuTe library's layout algebra to automate the mapping of attention tensors to the Blackwell Tensor Core memory hierarchy, optimizing for the B200's specific warp-level matrix operations.
  • Memory Management: Implements a custom asynchronous copy pipeline that overlaps data movement from HBM to SRAM with the compute-heavy GEMM operations, effectively hiding memory latency.
  • Precision Support: Optimized specifically for BF16 and FP8 (E4M3) accumulation, utilizing the Blackwell-specific hardware support for high-throughput FP8 matrix multiplication.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

FlashAttention-4 will become the default inference kernel for all vLLM-based Blackwell deployments by Q3 2026.
The significant performance lead over cuDNN and Triton provides a clear incentive for the vLLM maintainers to prioritize this kernel for production-grade inference.
The CuTe-DSL approach will lead to a decline in hand-written CUDA kernel development for attention mechanisms.
The ability to achieve near-peak hardware utilization using Python-based DSLs reduces the barrier to entry for high-performance kernel optimization.

โณ Timeline

2022-05
FlashAttention-1 introduced, focusing on IO-awareness to speed up Transformers.
2023-07
FlashAttention-2 released, improving parallelism and work partitioning for better GPU utilization.
2024-09
FlashAttention-3 announced, optimized for the Hopper architecture and FP8 precision.
2026-02
FlashAttention-4 development reaches stable release, targeting Blackwell B200 hardware.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—