DeepSeek AI Releases DeepGEMM: An Optimized FP8 GEMM Library for Dense and MoE Computation
Matrix multiplication lies at the heart of deep learning and high-performance computing. As AI models grow in complexity, efficient General Matrix Multiplication (GEMM) operations become critical for scaling up training and inference workloads. While FP8 (8-bit floating point) arithmetic has gained traction for accelerating computations with lower precision requirements, optimizing GEMM operations for FP8 remains a challenge due to numerical precision issues, hardware bottlenecks, and inefficient software implementations.
To address these challenges, DeepSeek AI has introduced DeepGEMM, a CUDA-based FP8 GEMM library designed to optimize performance for both dense and Mix-of-Experts (MoE) GEMMs. DeepGEMM serves as a critical building block in DeepSeek V3/R1 AI model training and inference, ensuring efficient and scalable matrix computations.
Understanding DeepGEMM and Its Importance
Traditional GEMM implementations, such as NVIDIA’s CUTLASS and other CUDA-based solutions, optimize matrix multiplications but often require complex template-based implementations, making them difficult to integrate and customize. Additionally, standard approaches struggle to efficiently handle MoE-based models, where only a subset of experts is activated per computation step.
- DeepGEMM is specifically designed to overcome these challenges by:
- Supporting FP8 arithmetic with fine-grained scaling
- Enhancing GEMM performance for dense and MoE architectures
- Utilizing Just-In-Time (JIT) compilation for runtime kernel optimization
- Ensuring compatibility with NVIDIA Hopper Tensor Cores
By integrating FP8 GEMM optimizations directly into AI pipelines, DeepGEMM aims to accelerate computations without compromising accuracy, particularly for large-scale language models and vision models.
Technical Innovations in DeepGEMM
DeepGEMM introduces several key innovations that enhance both its usability and performance:
1. FP8 Arithmetic with Fine-Grained Scaling
FP8-based computations offer significant speed improvements over traditional FP16 or FP32 operations but suffer from reduced numerical precision. To mitigate these issues, DeepGEMM employs fine-grained scaling strategies, allowing it to dynamically adjust precision while maintaining computational efficiency.
2. Two-Level Accumulation Strategy
One of the key challenges in FP8 tensor computations is the accumulation of imprecise values, leading to accuracy degradation. DeepGEMM implements a two-level accumulation strategy, leveraging CUDA cores to reduce precision loss while preserving computational speed.
3. JIT Compilation for Optimized Kernel Execution
Instead of relying on precompiled kernels, DeepGEMM dynamically compiles optimized kernels at runtime using a lightweight Just-In-Time (JIT) module. This approach eliminates unnecessary precompilation steps, enabling faster deployment and real-time kernel adjustments.
4. Support for Both Dense and MoE GEMMs
DeepGEMM is designed to handle both:
- Standard GEMMs for dense matrix multiplications
- Grouped GEMMs for MoE architectures, which require more flexible computation strategies to accommodate dynamic expert selection
The library introduces two MoE layouts—contiguous and masked, ensuring compatibility with models that allocate variable token counts per expert.
5. Optimized for NVIDIA Hopper GPUs
DeepGEMM fully utilizes the NVIDIA Hopper Tensor Memory Accelerator (TMA), optimizing data movement and reducing memory bandwidth bottlenecks. This results in improved efficiency, especially for long-sequence processing in AI workloads.
Performance Benchmarks and Comparisons

The performance of DeepGEMM has been rigorously tested across various GEMM configurations on NVIDIA H800 GPUs with NVCC 12.8. The results indicate significant improvements:
GEMM Type | Speedup (vs. CUTLASS) |
---|---|
Standard GEMM | 1.4x – 2.7x |
MoE GEMM (Contiguous Layout) | 1.1x – 1.2x |
MoE GEMM (Masked Layout) | 1.1x – 1.3x |
These speedups highlight the efficiency gains achieved through JIT compilation, TMA integration, and optimized FP8 arithmetic.
Why DeepGEMM Matters for AI Model Training and Inference
DeepGEMM is not just an incremental improvement but a game-changer for AI infrastructure optimization. It allows researchers and developers to:
- Accelerate matrix operations in large AI models
- Optimize memory utilization in dense and MoE architectures
- Improve efficiency in low-precision AI model training
- Seamlessly integrate with NVIDIA Hopper Tensor Cores
DeepGEMM’s clean and lightweight architecture makes it ideal for both research and enterprise-level AI deployments.
Conclusion: A Step Forward in FP8 GEMM Optimization
With DeepGEMM, DeepSeek AI has introduced a highly efficient and flexible GEMM solution tailored for modern AI workloads. By combining FP8 arithmetic, JIT-compiled runtime optimizations, and MoE-aware processing, DeepGEMM sets a new standard for AI model training and inference.
As AI models continue to grow in size and complexity, tools like DeepGEMM will play a crucial role in enabling efficient, large-scale deep learning systems.
Would you like to explore DeepGEMM for your AI workflows? Check out the GitHub repository and dive into the future of FP8-optimized AI computing!
Do you have an incredible AI tool or app? Let’s make it shine! Contact us now to get featured and reach a wider audience.
Explore 4000+ latest AI tools at AI Toolhouse 🚀. Don’t forget to follow us on LinkedIn. Do join our active AI community on Discord.
Read our other blogs on LLMs 😁
If you like our work, you will love our Newsletter 📰