Deepseek 相关模型整理(2025-2026)

Kevin 吴嘉文大约 5 分钟

DeepSeek-V3.1

671B-A37B 模型，基于 DeepSeek-V3 架构继续训练和后训练得到的 hybrid model：同一个模型可以通过 chat template 切换 “thinking mode” 和 “non-thinking mode”。

thinking efficiency 更高，long-context extension 更强

Deepseek-V3.2

25 年 10 月左右的模型，huggingfaceopen in new window，arxivopen in new window

V3.2 在 V3.1-Terminus 基础上，引入 DeepSeek Sparse Attention，并进一步强化 reasoning 与 agent 能力

DeepSeek Sparse Attention ：V3.2-Exp 引入 DSA，是为了在 long context 下实现 faster and more efficient training & inference

V3.2 唯一的架构变化就是通过 continued training 引入了 DeepSeek Sparse Attention 。

普通 attention 是每个 token 看所有历史 token，所以长上下文下成本很高。DSA 的做法是：

用 lightning indexer 给历史 token 打分；
选出 top-k 个重要 key-value entries；
只对这些被选中的 token 做主 attention。

因此，主 attention 的复杂度从 $O(L^2)$ 降到 $O(Lk)$ 。

DeepSeek 没有从头训练一个 DSA 模型，而是从 DeepSeek-V3.1-Terminus checkpoint 出发，先进行 continued pre-training，再做 post-training。continued pre-training 又分两步：

阶段	做什么	目的
Dense Warm-up	冻结主模型，只训练 indexer	让 indexer 学会模仿 dense attention
Sparse Training	启用 top-k 稀疏选择，训练全模型	让模型适应 DSA 稀疏模式

这个训练流程的设计很重要，因为如果直接把 dense attention 换成 sparse attention，模型性能可能会明显掉。DeepSeek 用 warm-up + sparse training 的方式平滑过渡。

Post-Training：Specialist Distillation

先 RL 训练多个“专家模型”，再让这些专家模型生成高质量数据，最后用这些数据训练统一模型。

Post-Training：Mixed RL Training

RL 训练时候把三类目标合并：

训练类型	目标
reasoning RL	提升数学、代码、逻辑推理
agent RL	提升工具调用和多步任务执行
human alignment RL	提升回答质量、偏好对齐、通用帮助性

Post-Training：Scalable RL

V3.2 的一个核心观点是： 模型能力的提升不只来自 pre-training，也越来越依赖 post-training compute 的扩展 。

论文中提到，DeepSeek-V3.2 建立了更稳定、可扩展的 RL protocol，并且 post-training 阶段的计算预算超过了 pre-training cost 的 10%。这说明 DeepSeek 把 RL 不再看作简单的对齐阶段，而是看作继续提升 reasoning / agent 能力的重要 scaling 阶段。

这里使用的核心 RL 方法仍然是 GRPO ，但 V3.2 重点不是提出一个全新的 RL 算法，而是解决大规模 RL 训练中的稳定性问题。

主要包括：

技巧	解决的问题
Unbiased KL Estimate	让 KL 约束更稳定，避免训练中梯度异常
Off-policy Sequence Masking	避免过旧、偏离当前 policy 的负样本误导训练
Keep Routing	MoE 模型中保持 sampling 和 training 时专家路由一致
Keep Sampling Mask	保持采样阶段和训练阶段的 token 候选空间一致

这些技巧说明，V3.2 的 RL scaling 重点不是“换一个算法”，而是让 GRPO 能够在更大规模、更复杂任务上稳定运行。

Thinking in Tool-Use：把推理接入工具调用

V3.2 的另一个重点是： 不是单纯让模型会 function calling，而是让模型在 tool-use 过程中保留 reasoning 能力 。

在普通多轮工具调用中，如果每次工具返回后模型都丢掉之前的 reasoning，那么模型就需要反复重新思考，导致 token 浪费和 trajectory 变长。

V3.2 的做法是：

如果只是 tool output 加入上下文，就保留之前的 reasoning；
只有当新的 user message 出现时，才丢弃历史 reasoning；
即使丢弃 reasoning，也保留 tool call 和 tool result 历史。

这个设计的意义是： agent 不只是“调用工具”，而是要记住自己为什么调用工具，以及下一步该怎么利用工具结果。

Large-Scale Agentic Task Synthesis

V3.2 提出了大规模 agentic task synthesis pipeline，用来生成复杂工具调用任务。论文提到，它们生成了超过 1,800 个不同环境 和 85,000 个复杂 prompts ，用于推动 agent RL 训练。

这些任务覆盖多个方向：

Agent 类型	训练目标
Search Agent	学会搜索、验证信息、整合答案
Code Agent	学会修复代码、运行测试、处理真实软件工程任务
Code Interpreter Agent	学会在数学、数据分析、逻辑任务中调用代码
General Agent	学会在合成环境中完成复杂多步任务

这里最重要的思想是：

agent 数据不只是人工写 prompt，而是通过自动化 pipeline 生成环境、工具、任务和 verifier。

也就是说，DeepSeek 把 agent training 变成了一个可以 scale 的系统。

DeepSeek V4

huggingfaceopen in new window

官方发布了两个主要版本：

模型	总参数	激活参数	上下文长度	定位
DeepSeek-V4-Pro	1.6T	49B	1M	高性能主力模型
DeepSeek-V4-Flash	284B	13B	1M	更快、更便宜版本

对比 3.2，引入了 Hybrid Attention Architecture ，在 1M token context 下，V4-Pro 只需要 V3.2 约 27% 的 single-token inference FLOPs，以及 10% 的 KV cache。

CSA：Compressed Sparse Attention

CSA 会先把 KV entries 在 sequence 维度上压缩，然后再让 lightning indexer 选择重要的 compressed blocks。Hugging Face 官方博客解释说，CSA 大致是把 token 按块压缩，例如 4x 压缩，再在压缩后的 blocks 上做 sparse selection。

HCA：Heavily Compressed Attention

HCA 则是更激进的压缩。官方博客解释说，它会把 KV entries 大幅压缩，例如 128x，然后让 query dense attend 到这些 compressed blocks。因为压缩后序列已经很短，所以 dense attention 成本可以接受。

mHC：Manifold-Constrained Hyper-Connections

用于增强传统 residual connections，改进残差连接的方法，目标是让超大 MoE 模型训练更稳定。

使用了 Muon optimizer 训练。

Deepseek 相关模型整理(2025-2026)

# DeepSeek-V3.1

# Deepseek-V3.2

# DeepSeek V4

DeepSeek-V3.1

Deepseek-V3.2

DeepSeek V4