DeepSeek AI Releases DeepGEMM: An Optimized FP8 GEMM Library for Dense and MoE Computation

February 26, 2025February 26, 2025 Rishabh Dwivedi

0 Shares

Matrix multiplication lies at the heart of deep learning and high-performance computing. As AI models grow in complexity, efficient General Matrix Multiplication (GEMM) operations become critical for scaling up training and inference workloads. While FP8 (8-bit floating point) arithmetic has gained traction for accelerating computations with lower precision requirements, optimizing GEMM operations for FP8 remains a challenge due to numerical precision issues, hardware bottlenecks, and inefficient software implementations.

To address these challenges, DeepSeek AI has introduced DeepGEMM, a CUDA-based FP8 GEMM library designed to optimize performance for both dense and Mix-of-Experts (MoE) GEMMs. DeepGEMM serves as a critical building block in DeepSeek V3/R1 AI model training and inference, ensuring efficient and scalable matrix computations.

Understanding DeepGEMM and Its Importance

Traditional GEMM implementations, such as NVIDIA’s CUTLASS and other CUDA-based solutions, optimize matrix multiplications but often require complex template-based implementations, making them difficult to integrate and customize. Additionally, standard approaches struggle to efficiently handle MoE-based models, where only a subset of experts is activated per computation step.

DeepGEMM is specifically designed to overcome these challenges by:
Supporting FP8 arithmetic with fine-grained scaling
Enhancing GEMM performance for dense and MoE architectures
Utilizing Just-In-Time (JIT) compilation for runtime kernel optimization
Ensuring compatibility with NVIDIA Hopper Tensor Cores

By integrating FP8 GEMM optimizations directly into AI pipelines, DeepGEMM aims to accelerate computations without compromising accuracy, particularly for large-scale language models and vision models.

Technical Innovations in DeepGEMM

DeepGEMM introduces several key innovations that enhance both its usability and performance:

1. FP8 Arithmetic with Fine-Grained Scaling

FP8-based computations offer significant speed improvements over traditional FP16 or FP32 operations but suffer from reduced numerical precision. To mitigate these issues, DeepGEMM employs fine-grained scaling strategies, allowing it to dynamically adjust precision while maintaining computational efficiency.

2. Two-Level Accumulation Strategy

One of the key challenges in FP8 tensor computations is the accumulation of imprecise values, leading to accuracy degradation. DeepGEMM implements a two-level accumulation strategy, leveraging CUDA cores to reduce precision loss while preserving computational speed.

3. JIT Compilation for Optimized Kernel Execution

Instead of relying on precompiled kernels, DeepGEMM dynamically compiles optimized kernels at runtime using a lightweight Just-In-Time (JIT) module. This approach eliminates unnecessary precompilation steps, enabling faster deployment and real-time kernel adjustments.

4. Support for Both Dense and MoE GEMMs

DeepGEMM is designed to handle both:

Standard GEMMs for dense matrix multiplications
Grouped GEMMs for MoE architectures, which require more flexible computation strategies to accommodate dynamic expert selection

The library introduces two MoE layouts—contiguous and masked, ensuring compatibility with models that allocate variable token counts per expert.

5. Optimized for NVIDIA Hopper GPUs

DeepGEMM fully utilizes the NVIDIA Hopper Tensor Memory Accelerator (TMA), optimizing data movement and reducing memory bandwidth bottlenecks. This results in improved efficiency, especially for long-sequence processing in AI workloads.

Performance Benchmarks and Comparisons

The performance of DeepGEMM has been rigorously tested across various GEMM configurations on NVIDIA H800 GPUs with NVCC 12.8. The results indicate significant improvements:

GEMM Type	Speedup (vs. CUTLASS)
Standard GEMM	1.4x – 2.7x
MoE GEMM (Contiguous Layout)	1.1x – 1.2x
MoE GEMM (Masked Layout)	1.1x – 1.3x

These speedups highlight the efficiency gains achieved through JIT compilation, TMA integration, and optimized FP8 arithmetic.

Why DeepGEMM Matters for AI Model Training and Inference

DeepGEMM is not just an incremental improvement but a game-changer for AI infrastructure optimization. It allows researchers and developers to:

Accelerate matrix operations in large AI models
Optimize memory utilization in dense and MoE architectures
Improve efficiency in low-precision AI model training
Seamlessly integrate with NVIDIA Hopper Tensor Cores

DeepGEMM’s clean and lightweight architecture makes it ideal for both research and enterprise-level AI deployments.

Conclusion: A Step Forward in FP8 GEMM Optimization

With DeepGEMM, DeepSeek AI has introduced a highly efficient and flexible GEMM solution tailored for modern AI workloads. By combining FP8 arithmetic, JIT-compiled runtime optimizations, and MoE-aware processing, DeepGEMM sets a new standard for AI model training and inference.

As AI models continue to grow in size and complexity, tools like DeepGEMM will play a crucial role in enabling efficient, large-scale deep learning systems.

Would you like to explore DeepGEMM for your AI workflows? Check out the GitHub repository and dive into the future of FP8-optimized AI computing!

Do you have an incredible AI tool or app? Let’s make it shine! Contact us now to get featured and reach a wider audience.

Explore 4000+ latest AI tools at AI Toolhouse 🚀. Don’t forget to follow us on LinkedIn. Do join our active AI community on Discord.

Read our other blogs on LLMs 😁

If you like our work, you will love our Newsletter 📰