DeepSeek’s Breakthrough: Model Innovations That Slash Training Costs

A summary of what innovations I found really interesting about DeepSeek.

Feb 07, 2025

In Isaac Asimov’s classic science fiction epic Foundation, Hari Seldon moves his team of psychohistorians to a desolate planet called Terminus on the outskirts of the Galactic Empire. One of the reasons he does this is to ensure that his group of scientists are forced to work with limited resources, making the technology they develop highly efficient.

I have a theory that this is exactly what happened with the team behind DeepSeek. Facing restrictions on chip imports due to sanctions, they were forced to push the boundaries of efficiency. But these innovations are not just relevant for Chinese companies. Throughout this iteration of the AI hype cycle, the dominant narrative has been that better models require more compute power—a level of compute that is out of reach for most businesses and hobbyists. What DeepSeek has done is shift this narrative. They’ve shown that participation in the AI boom isn’t limited to giants like Meta, Google, Microsoft, or OpenAI.

DeepSeek has introduced two major categories of innovation: Infrastructure and Model Training.

On the infrastructure side, they’ve made two key advancements:

Improved Communication Between Training Nodes – DeepSeek developed their own networking layer to optimize how GPUs communicate within the training cluster, using PTX (Assembly for Nvidia GPUs).
FP8 Mixed-Precision Training – They introduced a framework to balance model precision and compute efficiency, reducing computational overhead.

I won’t delve deeper into these aspects since infrastructure isn’t my specialty. But where DeepSeek really excites me is in model training, where I see four major innovations that could have a lasting impact on the industry:

1. Multi-Head Latent Attention (MLA)

For an LLM to process input text, it first converts it into a numerical representation called an embedding (see my previous post: Teaching a Machine to Read: How LLMs Comprehend Text). Once we have this embedding, the model uses an attention mechanism to determine which parts of the input are most important.

Consider the sentence: The farmer has a horse. The model might decide that “farmer” and “horse” are the key words, assigning attention weights like [0, 0.5, 0, 0, 0.5]. In modern LLMs, this process isn’t done once but across multiple attention heads, each capturing different aspects of importance. These are then combined into a single attention representation.

DeepSeek makes this mechanism more efficient by implementing low-rank approximations of the embedding vectors, and then applying the attention calculations. Normally, an embedding size of 1024 with 8 attention heads would require storing 8,192 parameters. But if DeepSeek compresses the vector by a factor of 4 (so each head uses only 256 parameters), memory usage during training and inference drops by 75%—without sacrificing model performance.

2. DeepSeek-MOE (Mixture of Experts) Architecture

DeepSeek uses a Mixture of Experts (MoE) approach to optimize computational efficiency. DeepSeek-V3 is a massive model with 671 billion parameters, but instead of being one monolithic system, it consists of multiple smaller expert models. When a query is made, only one of these expert models (with 37 billion parameters) is activated. This ensures that not all parameters are in use at once, significantly reducing training and inference costs.

Balancing expert usage is a critical challenge in MoE models. A common issue, routing collapse, occurs when certain experts are overused while others remain under-trained. Traditionally, this is mitigated using an auxiliary loss function to penalize frequent selection of the same expert, but this approach is computationally expensive.

DeepSeek introduces a more efficient method by dynamically adjusting bias parameters between training steps. If an expert model is overused, its selection bias decreases, making it less likely to be chosen. Conversely, underused experts have their bias increased, improving their selection likelihood. This significantly reduces computational overhead while maintaining model balance.

3. Multi-Token Prediction

Most LLMs predict one token at a time. DeepSeek changes this by predicting multiple tokens simultaneously, which offers two major advantages:

Faster Learning: Instead of getting feedback on a single token per step, the model learns from multiple tokens, accelerating training.
Improved Text Understanding: Predicting multiple tokens forces the model to plan ahead, leading to a better grasp of long-range dependencies in text.

Additionally, multi-token prediction allows the model to share resources between predictions. For instance, embeddings are computed once and then reused across multiple prediction layers, improving efficiency.

4. Group Relative Policy Optimization (GRPO)

Traditionally, reinforcement learning for training LLMs follows an Actor-Critic model: an Actor model generates text, and a separate Critic model provides feedback on the quality of the output. This means training two models—one for generation and another for evaluation.

DeepSeek introduces Group Relative Policy Optimization (GRPO) to simplify this process. Instead of training a separate critic model, GRPO evaluates the quality of multiple predictions within a group:

Suppose the model generates 12 predictions using three different policies.
If Policy A produces 1 correct prediction, Policy B produces 2, and Policy D produces 4, the model adjusts training in favor of policies that performed better than the group’s average.

By eliminating the need for a dedicated critic model, DeepSeek can focus computational power entirely on improving the actor model, reducing training costs.

Conclusion

DeepSeek’s innovations demonstrates that we can make model training and inference more approachable. Don't get me wrong it's is still really expensive, but just not as expensive as presumably assumed.

This is particularly exciting because it lowers the barrier to entry. The AI arms race has largely been dominated by companies with massive GPU clusters, but DeepSeek’s approach suggests a different future—one where AI advancements are accessible to more developers, businesses, and even hobbyists.

As we move forward, we should ask ourselves: What other inefficiencies exist in AI today? If DeepSeek has shown us anything, it’s that necessity breeds innovation, and efficiency may just be the key to the next breakthrough.

Thoughts on Data

Discussion about this post