Deepseek 相关模型整理

Kevin 吴嘉文大约 15 分钟

DeepSeek-MoE

训练： 整个模型在 2T 的中英文预料上训练，实现了和 DeekSeek 7B 及 LlaMA 2 7B 差不都的效果。
模型效果： DeepSeekMoE 16B 推理时候，只用到了 2.8B 的参数，整体的 FLOPs 是 LlaMA 2 7B 的 39.6%；推理速度更快的同时，效果也不差。

架构： DeepSeekMoE 16B 主要亮点在于 fine-grained expert segmentation 和 shared experts isolation.

Fine-grained Expert Segmentation

如上图 B，DeepSeek-MoE 在减少了每个 expert FFN intermediate hidden dimension 的同时，增加激活的 expert 的数量，依次保证总体激活的 expert 的参数量一致。DeepSeekMoE 论文种认为，组合数量的提升，有利于 gate 更准确地选择 expert。

如当我们有 16 个 expert，然后选 top 2 进行推理时，activate expert 的组合数量有 $(^{16}_{2})=120$ 种组合，但当将每个 expert 参数缩小 4 倍，expert 个数增加为 64 时，选取 top 8 进行推理时， activate expert 的组合书来给你就有 $(_{8}^{64})=442165368$ 种。

Shared Expert Isolation

如上图 C，设立一部分 Shared Expert，每次推理的时候都会激活。

DeepSeek-V2

githubopen in new window, 论文链接open in new window，权重下载open in new window，huggingface 模型代码open in new window

DeepSeek-V2 文中推出了 DeepSeek-V2-Lite 与 DeepSeek-V2 一小一大 2 个版本。

模型大小： DeepSeek-V2 整个模型有 236B 参数，其中推理激活参数为 21B。具体的架构参数可以查看：huggingface DeepSeek V2 configopen in new window
模型效果：

推理速度：

DeepSeek V2 首先对模型进行了 KV Cache 量化，将参数转换为了 FP8。在单机 8 卡 H800 的节点上部署 DeepSeek-V2，可以达到约 50K tokens/秒的吞吐量

推理效果，中文水平更强一些，英文水平于 Mixtral 8*22B 有的一比：

模型架构重点 ：其中，架构采用了 MLA 取代 MHA，同时 MOE 架构采用了 DeepSeekMoE 的 fine-grained expert segmentation 和 shared experts isolation。整体的 DeepSeek Layer 架构如下：

对应到 huggingface 的实现为：

huggingface 代码

class DeepseekV2DecoderLayer(nn.Module):
    def __init__(self, config: DeepseekV2Config, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = ATTENTION_CLASSES[config._attn_implementation](
            config=config, layer_idx=layer_idx
        )

        self.mlp = (
            DeepseekV2MoE(config)
            if (
                config.n_routed_experts is not None
                and layer_idx >= config.first_k_dense_replace
                and layer_idx % config.moe_layer_freq == 0
            )
            else DeepseekV2MLP(config)
        )
        self.input_layernorm = DeepseekV2RMSNorm(
            config.hidden_size, eps=config.rms_norm_eps
        )
        self.post_attention_layernorm = DeepseekV2RMSNorm(
            config.hidden_size, eps=config.rms_norm_eps
        )

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
        **kwargs,
    ) -> Tuple[
        torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
    ]:
        if "padding_mask" in kwargs:
            warnings.warn(
                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
            )
        residual = hidden_states

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
            **kwargs,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs

RMS norm

class DeepseekV2RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        DeepseekV2RMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)

DeepSeekMoE

如上图展示的，DeepSeek-V2 同样采用了 DeepSeekMoE 的策略，其中有 2 个 shared experts ， 160 个 routed experts（每次只激活 6 个）

Multi-Head Latent Attention

DeepSeek-V2 中着重讲了这一部分的优化。

Low-Rank Key-Value Joint Compression

为了减少 KV cache，MLA 提出将 k, v 的计算方式变为：

\begin{aligned} \bold c_t^{KV} &= W^{DKV}\bold h_t\\ \bold k_t^C &= W^{UK}\bold c^{KV}_t\\ \bold v_t^C &= W^{UV}\bold c^{KV}_t\\ \end{aligned}

其中， $c_t^{KV} \in \mathbb{R}^{d_c},W^{DKV} \in \mathbb R^{d_c\times d}, W^{UK} \in \mathbb{R}^{d_h n_h \times d_c}$ ，压缩后的维度 $d_c\ll d_h n_h$ ； $d_c$ 为 kv_lora_rank, d_h 为 head_dim （包括 q_head_dim 和 v_head_dim）

q 的计算方法变为：

\begin{aligned} \bold c_t^{Q} &= W^{DQ}\bold h_t\\ \bold q_t^C &= W^{UQ}\bold c^{Q}_t\\ \end{aligned}

其中， $c_t^Q \in \mathbb{R}^{d_c'},W^{DQ} \in \mathbb R^{d_c'\times d}, W^{UQ} \in \mathbb{R}^{d_h n_h \times d_c'}$ ； $d'_c$ 为 q_lora_rank , d 为 hidden_size，； $d_h$ 为 q_head_dim, $n_h$ 为 num_head

因此，k，v 均从 $\bold c_t^{KV}$ 进一步计算得来。在推理时候，传统 MHA 需要 cache $k,v$ ，但通过以上变化后，只需要 cache $\bold c_t^{KV}$ 即可。这样， $q,k$ 点积就变成了。

\begin{aligned} q^Tk & =(W^{UQ}W^{DQ}\bold h_t)^T(W^{UK}\bold c^{KV}_t)\\ &=\bold h^T((W^{UQ}W^{DQ})^TW^{UK})\bold c^{KV}_t \end{aligned}

推理过程中，可以合并 $(W^{UQ}W^{DQ})^TW^{UK})$ ，以此达到减少 cache 同时，不会增加太多的计算量。

Decoupled Rotary Position Embedding

以上方案的一个问题是，不兼容 RoPE。由于 RoPE 的存在，

$q$ 不再是单纯的 $W\bold h$ ，而是需要内积上相对位置矩阵 $R$ ，因此就无法简单得合并 $(W^{UQ}W^{DQ})^TW^{UK})$ 。

MLA 采用了以下 decoupled RoPE 方案：

\begin{aligned} \left[ q_{t,1}^R ; q_{t,2}^R ; \cdots ; q_{t,n_h}^R \right] &= q_t^R = \text{RoPE}(W^{QR} \bold c_t^Q), \quad \quad \quad \\ \bold k_t^R &= \text{RoPE}(W^{KR} \bold h_t), \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \\ \bold q_{t,i} &= \left[\bold q_{t,i}^C ; \bold q_{t,i}^R \right], \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \\ \bold k_{t,i} &= \left[\bold k_{t,i}^C ; \bold k_{t}^R \right], \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \\ \bold o_{t,i} &= \sum_{j=1}^t \text{Softmax}_j \left( \frac{\bold q_{t,i}^T \bold k_{j,i}}{\sqrt{d_h + d_h^R}} \right) \bold v_{j,i}^C, \quad \quad \\ \bold u_t &= W^O \left[ o_{t,1} ; o_{t,2} ; \cdots ; o_{t,n_h} \right], \quad \quad \quad \quad \quad \end{aligned}

大概思路是，在原先的 qk 中，增加几个维度，用来注入 RoPE 位置信息，比较值得注意的是，k 新增加的维度 $\bold k_{t}^R$ 是所有 head 共享的。其中， $W^{QR} \in \mathbb{R}^{d_h^R n_h \times d_c'}$ , $W^{KR} \in \mathbb{R}^{d_h^R \times d}$ , 因此，q,k 的维度增加到了 $(d_c + d_h^R)$ 。

更深入的 MLA 解读，可以参考：缓存与效果的极限拉扯：从 MHA、MQA、GQA 到 MLAopen in new window 或 deepseek v2 原论文。

参考 huggingface 中 DeepseekV2Attention 的实现：

class DeepseekV2Attention(nn.Module):
    # 该笔记中省略了部分不重要的代码

    def __init__(self, config: DeepseekV2Config, layer_idx: Optional[int] = None):
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx
        self.attention_dropout = config.attention_dropout
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads

        self.max_position_embeddings = config.max_position_embeddings
        self.rope_theta = config.rope_theta
        self.q_lora_rank = config.q_lora_rank
        self.qk_rope_head_dim = config.qk_rope_head_dim
        self.kv_lora_rank = config.kv_lora_rank
        self.v_head_dim = config.v_head_dim
        self.qk_nope_head_dim = config.qk_nope_head_dim
        self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim

        self.is_causal = True

        self.q_a_proj = nn.Linear(
            self.hidden_size, config.q_lora_rank, bias=config.attention_bias
        )  # W^{DQ}; 其中 q_lora_rank = d'
        self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
        self.q_b_proj = nn.Linear(
            config.q_lora_rank, self.num_heads * self.q_head_dim, bias=False  # 1536, 32 * (128+64)
        ) # W^{UQ}

        self.kv_a_proj_with_mqa = nn.Linear(
            self.hidden_size,  # 4096
            config.kv_lora_rank + config.qk_rope_head_dim,  # 512 + 64
            bias=config.attention_bias,
        )
        self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
        self.kv_b_proj = nn.Linear(
            config.kv_lora_rank,
            self.num_heads
            * (self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
            bias=False,
        )

        self.o_proj = nn.Linear(
            self.num_heads * self.v_head_dim,
            self.hidden_size,
            bias=config.attention_bias,
        )
        self._init_rope()

        self.softmax_scale = self.q_head_dim ** (-0.5)

    def _init_rope(self):
        self.rotary_emb = DeepseekV2RotaryEmbedding(
            self.qk_rope_head_dim,
            max_position_embeddings=self.max_position_embeddings,
            base=self.rope_theta,
        )

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        **kwargs,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()

        q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))  # q = W^{UQ}(norm(W^{DQ}h))  [batch size, len_seq, num_head * (128+64)]
        q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
        q_nope, q_pe = torch.split(
            q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
        )  
        # [bsz, num_head, q_len, 128] 
        # [bsz, num_head, q_len, 64]

        compressed_kv = self.kv_a_proj_with_mqa(hidden_states)  # [bsz, len_seq, 512 + 64]
        
        compressed_kv, k_pe = torch.split(
            compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
        ) 
        k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2) # [bzs, 1, q_len, 64]
        kv = (
            self.kv_b_proj(self.kv_a_layernorm(compressed_kv))  # W^{UK}(norm(c_t^{KV}))
            .view(bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
            .transpose(1, 2)
        )  # [bsz, num_head, q_len, 128 + 128]

        k_nope, value_states = torch.split(
            kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1
        )
        kv_seq_len = value_states.shape[-2]

        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)

        # 应用 decouple ROPE 方案
        q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)

        query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
        query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
        query_states[:, :, :, self.qk_nope_head_dim :] = q_pe

        key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
        key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
        key_states[:, :, :, self.qk_nope_head_dim :] = k_pe

        attn_weights = (
            torch.matmul(query_states, key_states.transpose(2, 3)) * self.softmax_scale
        )

        # upcast attention to fp32
        attn_weights = nn.functional.softmax(
            attn_weights, dim=-1, dtype=torch.float32
        ).to(query_states.dtype)
        attn_weights = nn.functional.dropout(
            attn_weights, p=self.attention_dropout, training=self.training
        )
        attn_output = torch.matmul(attn_weights, value_states)

        attn_output = attn_output.transpose(1, 2).contiguous()

        attn_output = attn_output.reshape(bsz, q_len, self.num_heads * self.v_head_dim)

        attn_output = self.o_proj(attn_output)

        return attn_output, attn_weights, past_key_value

文中给出了 MLA 与 MHA, GQA 的效果对比：

MLA 的 KV 的 cache 数量比 MHA, GQA 少了不少。

MLA KV cache 比 MHA 小的同时，效果也不会太差。

Auxiliary Loss for Load Balance

Deepseek V2 再模型训练时候，对 loss 进行了处理，以确保所有的 expert 都能得到合理的训练。假设 $u_t$ 为 FFN 的输入，那么 FFN 的输出计算公式如下：

\begin{align*} h'_t &= u_t + \sum_{i=1}^{N_s} \mathrm{FFN}_i^{(s)}\bigl(u_t\bigr) + \sum_{i=1}^{N_r} g_{i,t}\,\mathrm{FFN}_i^{(r)}\bigl(u_t\bigr), \\[1ex] g_{i,t} &= \begin{cases} s_{i,t}, & s_{i,t} \in \mathrm{Topk}\bigl(\{\,s_{j,t}\mid 1 \le j \le N_r\},\,K_r\bigr), \\[1ex] 0, & \text{otherwise}, \end{cases} \qquad s_{i,t} = \mathrm{Softmax}_i\bigl(u_t^{T} e_i\bigr). \end{align*}

其中 $N_s$ 和 $N_r$ 为 shared 和 routed expert 数量， $FFN^{(s)}_i$ 或 $FFN^{(r)}_i$ 为第 i 个 expert； $K_r$ 为激活的 routed expert 的数量； $g_{i,t}$ 为 gate value；

训练 loss 主要分为下面几个部分：

Expert level Balance Loss：

\begin{aligned} \mathcal{L}_{\mathrm{ExpBal}} &= \alpha_{1} \sum_{i=1}^{N_{r}} f_{i} \, P_{i}, \\[1em] f_{i} &= \frac{N_{r}}{K_{r}\,T} \sum_{t=1}^{T} \mathbf{1}\bigl(\text{Token } t \text{ selects Expert } i\bigr), \\[1em] P_{i} &= \frac{1}{T} \sum_{t=1}^{T} s_{i,t}. \end{aligned}

其中 $\alpha_1$ 为超参， $T$ 为当前 token 的数量；当 routed expert 选择不平衡，有某个 routed expert 经常被选的时候， $\mathcal{L}_{\mathrm{ExpBal}}$ 会升高；

Device-Level Balance Loss

Device-Level Balance Loss 的形式与 expert-level Balance Loss 相似，用于调整设备级别的平衡损失，以确保不同设备之间的计算负载均衡。在 DeepSeek-V2 的训练过程中，所有 routed expert 被划分为 D 组 $\{\mathcal{E}_1,\mathcal{E}_2, ..., \mathcal{E}_D\}$ ，并将每一组部署到单独的一台设备上。设备级别的平衡损失计算如下：

\begin{align*} \mathcal{L}_{\mathrm{DevBal}} &= \alpha_{2} \sum_{i=1}^{D} f'_i \, P'_i, \\[1ex] f'_i &= \frac{1}{\lvert \mathcal{E}_i \rvert} \sum_{j \in \mathcal{E}_i} f_j, \\[1ex] P'_i &= \sum_{j \in \mathcal{E}_i} P_j. \end{align*}

其中 $\alpha_2$ 为超参；

Communication Balance Loss

Communication Balance Loss 用以确保每个设备的通信负载均衡。虽然设备限制路由机制保证了每个设备的发送通信量有上限，但如果某个设备接收的 token 比其他设备多，实际通信效率也会受到影响。为了缓解这一问题，deepseek 训练时采用了 Communication Balance Loss 计算如下：

\begin{align*} \mathcal{L}_{\mathrm{CommBal}} &= \alpha_{3} \sum_{i=1}^{D} f''_{i}\,P''_{i}, \\[1ex] f''_{i} &= \frac{D}{M\,T} \sum_{t=1}^{T} \mathbf{1}\bigl(\text{Token }t\text{ is sent to Device }i\bigr), \\[1ex] P''_{i} &= \sum_{\,j \in \mathcal{E}_{i}} P_{j}. \end{align*}

其中， $\alpha_3$ 为超参。设备限制路由机制的原则是确保每个设备最多向其他设备传输 MT 个 hidden state。同时，引入通信平衡损失以鼓励每个设备从其他设备接收大约 MT 个 hidden state。

除了以上几个 loss 外，Deepseek 再训练时也采用了 Token-Dropping Strategy。参考这个 issueopen in new window，Deepseek-V3 中似乎没有采取该策略。

预训练： 相比于 DeepSeek 67B，DeepSeek-V2 训练集中有更多的中文数据，同时 DeepSeek V2 对数据过滤算法进行了改进，包括筛除有争议的内容等；DeepSeek V2 使用于 DeepSeek 67B 同样的 tokenizer，vocab size 为 100k，预训练语料约有 8.1T tokens，其中中文比英文多了 12%。整个预训练花费了 172.8K GPU hours 的算力。
超长上下文： 在适配了 Yarn 之后，额外在 32k 的数据集上训练了 1000 steps，batch size 为 576。文中表示，尽管是在 32K 数据集上训练，但在 128K 的大海捞针测试中，模型表现也不错：

SFT ：在包含了 1.5M 组训练实例的数据集上，微调了 2 个 epoch。
RLHF： 采用了 GRPO 来节省 RL 训练的成本，主要是将 PPO 过程中的 advantage 替换成了 $\hat A_{i,t} = \frac {r_i - mean(r)}{std(r)}$ ，因此在 RLHF 过程中就不需要 Value model 了。具体算法如下：

在 RLHF 训练过程中，采取了 2 阶段训练。首先进行了 reasoning alignment，而后进行 human preference alignment。

Inference Efficiency: Deepseek V2 采用了一些策略来提高推理性能
参数转化为 FP8
KV Cache 量化

部署了 Deepseek V2 MOE 后，再 1 台 H800 * 8 上可以实现 50K 每秒的 token 吞吐量。

DeepSeek V2 中提出的一些讨论

SFT 数据量与质量
实验表明，低于 1 万条 SFT 数据会导致 IFEval 基准性能大幅下降。
随模型规模增大，所需数据量会减小，但仍需足够数据以掌握特定技能。
高质量的 SFT 数据对写作和开放式问答任务尤为重要。
对齐成本（Alignment Tax）
RLHF 可显著提升开放式生成任务表现，但可能削弱 BBH 等标准基准性能。
通过精细化数据处理和改进训练策略，实现两者间的可接受权衡。
如何在保持整体性能的前提下完成偏好对齐，值得进一步研究。
在线强化学习优势
实验发现在线 RL 明显优于离线 RL，因此为 DeepSeek-V2 构建了在线 RL 框架。
在线/离线偏好对齐效果或因场景不同而异，后续需进行更全面对比。

更多训练细节欢迎参考 DeepSeek-V2 论文open in new window

Deepseek V3

DeepSeek-V3 Technical Reportopen in new window， huggingface 空间open in new window，Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architecturesopen in new window

架构

基础架构与 DeepSeek-V2 差不多，采用了 DeepSeek-v2 的 MLA 和 DeepSeekMOE，不同于 DeepSeek-V2 的是，采用了 auxiliary-loss-free load banlancing strategy。
auxiliary-loss-free load banlancing strategy 在选择 expert 的计算过程中，加入了参数 $b_i$

g'_{i,t} \;=\; \begin{cases} s_{i,t}, & s_{i,t} + b_{i} \,\in\, \mathrm{Topk}\bigl(\{\,s_{j,t} + b_{j} \mid 1 \le j \le N_r\},\,K_r\bigr), \\[1ex] 0, & \text{otherwise.} \end{cases}

论文中也给出了对比数据

提出了 Multi Token Prediction 目标

如上图所示，再训练时候，采用多个 MTP Module 对后几个 token 进行预测，第 k 个 MTP module 预测第 $t+k+1$ 个开始的 token。如上图，假设整个序列有 T 个 token，那么 MTP Module 1 开始就预测第 3 到 T 个 token。 MTP Module 2 开始就预测第 4 到 T 个 token。

额外的优化目标为各个 MTP Module 的 cross entropy loss 平均； $\lambda$ 为超参；

\mathcal{L}_{\mathrm{MTP}} = \frac{\lambda}{D} \sum_{k=1}^{D} \mathcal{L}_{\mathrm{MTP}}^{k}.

论文中也给出了 MTP 的对比测试

训练策略

Deepseek V3 在 2048 个 H800 上训练。每个节点有 8 个 H800 GPU，通过 NVLink 和 NVSwitch 链接。节点接通信采用 InfiniBand。一些特殊的训练策略：

采用 DualPipe
开发了高效的 cross-node all-to-all communication kernel 来充分调用 IB 和 NVLink 的带宽，以及节约用于通信的 Streaming Multiprocessors 资源。
通过一系列显存管理策略，无需 Tensor Parallelism (TP) 的前提下即可完成 DeepSeek‑V3 的大规模训练。
F8 混合精度训练，大部分的矩阵乘积操作都采用了 FP8 精度作为输入，输出 FP16 或者 FP32 。embedding，output head，MOE Gating 模块，norm 操作，attention 操作都保留了原始的精度（BF16 或 FP32）

Fine-Grained Quantization：

如下图，(该部分包含 GPT 解释)

分块量化：input 根据 1*128 进行分组和 scale。weight 根据 128*128 进行 scale 和 group。每块先计算最大绝对值，再生成一个缩放因子 $N$ ，把所有实数除以它并四舍五入为 INT8 / FP8 → 得到 $S$ 。

W_i = N_i \ast S_i,\quad W_a = N_a \ast S_a,

其中 $\ast$ 是 逐元素缩放 （对角/块对角矩阵左乘，图示用颜色区分）。

低精 GEMM（Tensor Core）：只把 整数部分 $S_i$ 与 $S_a$ 喂给 Tensor Core 做矩阵乘法 → 绿色 × 黄色 ⇒ 紫色低精积累。这一步等价于

\underbrace{S_i S_a^{\top}}_{\text{INT/FP8 乘 + 低精累积}}

高精缩放（CUDA Core）：

低精结果再乘左右两端的缩放：

W_i W_a^{\top} = (N_i S_i)(N_a S_a)^{\top} = N_i \bigl(S_i S_a^{\top}\bigr) N_a^{\top}.

图中下方 “Output” 方框：先输出低精积累结果，再在 CUDA Core 里把蓝色 $N_i$ 和粉色 $N_a$ 乘回去，得到 FP32 输出。
由于 $N_i, N_a$ 是 块对角 或向量，只需一次按行/列乘即可，计算量很小。

预训练

数据数量：14.8T，

Long Context Extension

采用 YARN 额外训练了 2 个 1000 steps。分别将上下文窗口从 4k 提升到 32k，然后再提升到 128k。

后训练

SFT -

Reinforcement learning - V3 采用了 rule-based reward Model 和 model-based RM 2 中不同的 reward model

Rule-based RM：比如对于一些数学题，一些 LeetCode 题目等有准确答案的 input，可以采用固定的方式来提供 feedback。
Model Based RM：
- 有唯一正确答案的题目 ：由奖励模型（reward model）判断生成答案是否与预期的“真实值”匹配。
  无唯一“标准答案”的题目（如创意写作） ：奖励模型根据问题与对应回答给出反馈分数。
- 奖励模型来源
- 以 DeepSeek-V3 的 SFT（监督微调）检查点 为基础进行训练。
- 提高奖励模型可靠性的方法
- 构建的偏好数据不仅包含最终的奖励分数，还记录 产生该分数的推理链（chain-of-thought） 。
- 这样可以在具体任务中减少“奖励漏洞”（reward hacking）的风险，避免模型为了拿高分而走偏门。
Group Relative Policy Optimijzation： V3 同样采用了 GRPO

探索

通过 DeepSeek-R1 蒸馏

通过 DeepSeek -R1 进行蒸馏，模型回答的质量会提高，但是回复长度也会边长。

self-rewarding

RL 阶段，采用了 constitutional AIopen in new window 的策略，让模型进行自我反馈。

multi-token prediction evaluation

MTP 让模型在推理阶段能够一次 forward 预测 2 个 token，并且 evaluation 阶段表明，第二个 token 的接受率在 85% - 90% 之间。当接受率高的时候， MTP 能够让 DeepSeek-V3 的推理速度提升 1.8 倍左右。

Deepseek R1

OpenAI o1

https://openai.com/index/learning-to-reason-with-llms/

O1 发现：Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process.

模型的效果和强化学习训练时间，以及推理时间有关。

O1 中有特别明显的反思步骤，能够识别过去思维中的错误，以及。

DeepSeek 系列

https://api-docs.deepseek.com/zh-cn/news/news250120

https://huggingface.co/deepseek-ai

deepseek-Prover

deepseek-VL2

deepseek-R1

deepseek v2.5 Vision

Deepseek 相关模型整理

# DeepSeek-MoE

# DeepSeek-V2

# Deepseek V3

# 架构

# 训练策略

# 预训练

# Long Context Extension

# 后训练

# 探索

# Deepseek R1

# OpenAI o1

# DeepSeek 系列