Qwen3.5-MoE 多模态大模型架构深度解析

文档版本: v1.0
分析日期: 2026-02-22
分析来源: config.json + quant_model_weights.safetensors.index.json
架构标识: Qwen3_5MoeForConditionalGeneration

1. 模型全局概览

维度	值
架构类型	`Qwen3_5MoeForConditionalGeneration`
模型类别	多模态（Vision-Language）MoE
权重总量	~420.7 GB（量化后）
分片文件	99 个 safetensors
权重条目	279,374 条
上下文窗口	262,144 tokens（256K）
词表大小	248,320
精度	bfloat16（部分组件量化为低精度）
Transformers 版本	4.57.0.dev0

1.1 模型四大模块

┌─────────────────────────────────────────────────────────┐
│ Qwen3.5-MoE │
├──────────────┬──────────────┬──────────┬────────────────┤
│ Vision       │ Language     │ MTP      │ LM Head        │
│ Encoder      │ Model        │ Module   │                │
│ (27-layer    │ (60-layer    │ (1-layer │ (线性投影)     │
│ ViT)         │ Hybrid-MoE)  │ MoE)     │                │
├──────────────┴──────────────┴──────────┴────────────────┤
│ Shard 分布：Vision=1 | LM=96 | MTP=4 | LM Head=1          │
└─────────────────────────────────────────────────────────┘

2. 视觉编码器（Vision Encoder）

基于 ViT（Vision Transformer）架构，负责将图像/视频帧编码为视觉 token 序列。

2.1 核心参数

参数	值	说明
depth	27	Transformer Block 层数

Token	ID	用途
`vision_start`	248053	视觉序列起始标记
`vision_end`	248054	视觉序列结束标记
`image_token`	248056	图像占位 token
`video_token`	248057	视频占位 token

参数	值	说明
num_hidden_layers	60	总层数
hidden_size	4096	隐藏维度
vocab_size	248,320	词表大小
max_position_embeddings	262,144	最大位置（256K）
rms_norm_eps	1e-6	RMSNorm epsilon
hidden_act	silu	FFN 激活函数
tie_word_embeddings	false	Embedding 与 LM Head 不共享

参数	值	说明
linear_key_head_dim	128	Key 头维度
linear_num_key_heads	16	Key 头数（KV 共享结构）
linear_value_head_dim	128	Value 头维度
linear_num_value_heads	64	Value 头数
linear_conv_kernel_dim	4	1D 卷积核大小

权重名	说明
`in_proj_qkv.weight`	融合 QKV 输入投影
`in_proj_z.weight`	门控投影 Z
`in_proj_a.weight`	SSM 参数 A 投影
`in_proj_b.weight`	SSM 参数 B 投影
`A_log`	状态转移矩阵（log 空间）
`dt_bias`	时间步长偏置 Δ
`conv1d.weight`	局部卷积（kernel=4）
`norm.weight`	归一化
`out_proj.weight`	输出投影

参数	值	说明
num_attention_heads	32	Q 头数
num_key_value_heads	2	KV 头数（GQA 比率 16:1）
head_dim	256	每头维度
Q 总维度	32 × 256 = 8192
KV 总维度	2 × 256 = 512
attn_output_gate	true	输出门控
attention_bias	false	无注意力偏置

参数	值	说明
rope_type	default	标准 RoPE
rope_theta	10,000,000	频率基数（10M，支持超长上下文）
partial_rotary_factor	0.25	仅 25% 维度应用旋转
旋转维度	256 × 0.25 = 64	实际参与 RoPE 的维度
mrope_interleaved	true	交错式多模态 RoPE
mrope_section	[11, 11, 10]	高度/宽度/时间维度分配

参数	值	说明
num_experts	512	专家总数
num_experts_per_tok	10	每 token 激活专家数
moe_intermediate_size	1024	专家中间维度
shared_expert_intermediate_size	1024	共享专家中间维度
router_aux_loss_coef	0.001	路由辅助损失系数

组件	参数量
512 Expert	512 × 3 × 4096 × 1024 = 6.44B
Shared Expert	3 × 4096 × 1024 = 12.6M
Router Gate	4096 × 512 = 2.1M
每层 MoE 小计	~6.45B

组件	是否量化	说明
MoE Expert FFN (`gate/up/down_proj`)	是	含 `weight_scale` + `weight_offset`
Self-Attention QKV (`q/k/v_proj`)	是	含 `weight_scale` + `weight_offset`
Self-Attention Output (`o_proj`)	否	保持全精度
Linear Attention 全部权重	否	SSM 对精度敏感
Shared Expert	否	始终激活，保持精度
Router Gate	否	路由精度直接影响专家选择
RMSNorm	否	保持全精度
Embedding / LM Head	否	保持全精度

参数	值	说明
mtp_num_hidden_layers	1	MTP Transformer 层数
mtp_use_dedicated_embeddings	false	复用主模型 Embedding

┌─────────────────┐ ┌──────────────────┐ │ Embedding of │ │ Hidden state from│ │ current token │ │ last LM layer │ └────────┬────────┘ └────────┬─────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌──────────────────┐ │pre_fc_norm_ │ │pre_fc_norm_ │ │embedding │ │hidden │ │(RMSNorm) │ │(RMSNorm) │ └────────┬────────┘ └────────┬─────────┘ │ │ └─────────┬───────────┘ │ concat / combine ▼ ┌─────────────────┐ │ fc.weight │ │ (融合投影层) │ └────────┬────────┘ │ ▼ ┌───────────────────────┐ │ MTP Transformer │ │ Layer 0 │ │ ┌───────────────────┐ │ │ │ input_layernorm │ │ │ ├───────────────────┤ │ │ │ Self-Attention │ │ │ │ (q/k/v/o_proj + │ │ │ │ q_norm, k_norm) │ │ │ ├───────────────────┤ │ │ │post_attn_layernorm│ │ │ ├───────────────────┤ │ │ │ MoE FFN │ │ │ │ ├ Gate (→512) │ │ │ │ ├ Expert ×512 │ │ │ │ ├ Shared Expert │ │ │ │ └ Shared Gate │ │ │ └───────────────────┘ │ └───────────┬───────────┘ │ ▼ ┌─────────────────┐ │ mtp.norm │ │ (RMSNorm) │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ lm_head (复用) │ │ 预测 next-next │ │ token │ └─────────────────┘

特征	说明
深度	仅 1 层 Transformer，轻量化设计
注意力类型	Full Self-Attention（非 Linear Attention）
FFN 类型	与主模型完全同构的 MoE（512 Expert + Shared Expert）
Embedding	复用主模型 Embedding（`mtp_use_dedicated_embeddings=false`）
LM Head	复用主模型 `lm_head.weight`
融合方式	对 embedding 和 hidden state 分别 RMSNorm 后通过 FC 融合

模块	估算参数量	说明
Embedding	~1.02B	248,320 × 4,096
LM Head	~1.02B	248,320 × 4,096（不共享）
LM Layers — MoE FFN	~387.1B	60 × 512 × 3 × 4096 × 1024 + shared
LM Layers — Self-Attn (×15)	~1.07B	15 × (Q+K+V+O+norms)
LM Layers — Linear-Attn (×45)	~数 B	45 × SSM 参数
LM Layers — Norms/Router	~0.25B	60 × (2×layernorm + gate)
Vision Encoder	~0.3B	27 层 ViT + Merger
MTP	~6.5B	1 层 MoE Transformer
总参数量（估算）	~400B+

模块	激活参数量
MoE FFN（10/512 Experts）	60 × 10 × 3 × 4096 × 1024 ≈ 7.55B
Shared Expert	60 × 3 × 4096 × 1024 ≈ 0.76B
Attention（平均）	~数 B
Embedding + LM Head	~2.04B
每 Token 激活量（估算）	~15-20B

创新	说明
Hybrid Attention	3:1 比例混合 Linear Attention (Mamba SSM) 与 Full Self-Attention，兼顾 O(n) 效率与全局建模
超大规模 MoE	512 Expert + Shared Expert，每层 Top-10 路由，总参数 ~400B 但激活量仅 ~15-20B
M-RoPE	多模态旋转位置编码，三段式编码（高/宽/时间），原生支持图像与视频的空间 - 时序位置
MTP	DeepSeek-V3 风格的单层多 Token 预测头，训练增强 + 推理投机解码加速
Partial Rotary	仅 25% 维度应用旋转编码（64/256），其余维度自由学习，平衡位置感知与语义表达
选择性量化	仅量化 Expert FFN 和 Self-Attn QKV，保留 SSM、Shared Expert、Router 等关键组件的全精度

维度	Qwen3.5-MoE（本模型）	DeepSeek-V3	Qwen3-235B
总参数	~400B+	671B	235B
激活参数	~15-20B	37B	22B
专家数	512	256	128
激活专家	10	8	8
注意力	Hybrid (SSM+Attn)	Full Attention	Full Attention
MTP	1 层	1 层	无
多模态	原生（ViT + M-RoPE）	无（纯文本）	无（纯文本）

Qwen3.5-MoE 多模态大模型架构深度解析