基于 LLAMA 的大语言模型 LLM 推理过程快速入门 | 极客日志

PythonAI算法

基于 LLAMA 的大语言模型 LLM 推理过程快速入门

综述由AI生成大语言模型（LLM）推理过程涉及分词、嵌入、注意力机制及自回归生成等核心步骤。文章以 LLAMA 模型为例，解析了 Transformer 解码器结构、Tokenization 分词、Embedding 嵌入层、位置编码、多头自注意力机制（MHA）及 RMSNorm 归一化等技术细节。同时阐述了从输入预处理、模型前向传播到 logits 后处理采样的完整 Pipeline，并讨论了 KV-Cache 优化、部署参数配置及上下文窗口等关键概念，为理解大模型底层推理逻辑提供技术参考。

开源信徒发布于 2025/2/7更新于 2026/6/217 浏览

大模型笔记：以 LLAMA 为例，快速入门 LLM 的推理过程

本文借助 LLAMA 模型快速入门 LLM（大语言模型）的推理过程，技术细节很多都是通用的，也适合其他 LLM。LLM 的任务就是在阅读前 n 个单词后预测句子中下一个单词，输出取决于过去和现在输入，与未来无关。

什么是 LLM

LLM 模型主要由两个块组成：

编码器（Encoder）：接收输入并构建其表示形式（特征）。模型被优化为从输入中获取理解（比如判断文本情感）。
解码器（Decoder）：使用编码器的表示形式以及其他输入来生成目标序列。模型被优化用于生成输出。

LLAMA 属于Decoder-only models，只有 decoder 层。每次输入模型会带上上一次输出的结果，这与 CV 模型不同，CV 模型输入只需要一次即可得到结果。

这些部分都可以根据任务独立使用：

Encoder-only models：适用于需要理解输入的任务，例如句子分类和命名实体识别。
Decoder-only models：适用于生成性任务，如文本生成。
Encoder-decoder models or sequence-to-sequence models：适用于需要输入的生成性任务，例如翻译或摘要。

LLAMA 相关的知识点

LLAMA 的 decoder 部分的结构取自 Transformer。对于 LLAMA 来说，只用了 decoder 部分，重点关注以下几个概念：

Tokenization（分词器）
Embedding（嵌入层）
Positional Encoding（位置编码）
Self-attention（自注意力机制）
Multi-head attention（多头注意力与采用掩码机制的多头注意力）
Batch Norm & Layer Norm（批标准化/层标准化，LLAMA 用的是 RMSNorm）
ResNet（残差网络）

拿到 LLAMA 模型后，主要关注：

模型的结构，包含哪些算子哪些 op，模型复杂度。
模型的前后处理，前后处理实现细节，模型的执行方式。
模型各种参数配置以及其他一些细节。

分词器、Token、Embedding

主要是分词、编码、Tokenizer（tokenization）、Embed（embedding）的过程。

什么是分词？

分词器可将原始文本转换为由 token 组成的文本的初始数值表征。分词器之所以是模型的重要构成部分之一，是因为模型可借此妥善应对人类语言的复杂性。例如，分词器可将凝集性语言中的词分解为更易管理的组成部分、处理原始语料库中不存在的新词或外来词/特殊字符，并确保模型生成紧凑的文本表征。

大部分基于 Transformer 的架构均使用经过训练的分词器，WordPiece（应用于 BERT）、SentencePiece（应用于 T5 或 RoBerta）等分词器同样具有多个变体。

代码示例

首先看 tokenizer，运行 LLAMA 的时候我们会调用 tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False)。

如果我们模型传入的是 LLAMA 的某个模型（llama-7b），那么返回的就是 LlamaTokenizer：

class LlamaTokenizer(PreTrainedTokenizer):
    """
    Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
    ...

这个类是 LLAMA 模型的分词器（tokenizer）的实现，基于字节级的字节对编码（Byte-Pair Encoding）。这个分词器的主要功能是将文本字符串转换为模型可以理解的数字序列，反之亦然。

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

input_ids    
tensor([[   0,  376, 1366,  338,  263, 3017,  775, 6160]], device='cuda:0')
input_ids.shape
torch.Size([1, 8])

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=31999)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

{
    "architectures": ["LLaMAForCausalLM"],
    "bos_token_id": 0,
    "eos_token_id": 1,
    "hidden_act": "silu",
    "hidden_size": 4096,
    "intermediate_size": 11008,
    "initializer_range": 0.02,
    "max_sequence_length": 2048,
    "model_type": "llama",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,
    "pad_token_id": 0,
    "rms_norm_eps": 1e-06,
    "torch_dtype": "float16",
    "transformers_version": "4.27.0.dev0",
    "use_cache": true,
    "vocab_size": 32000
}

class LlamaModel(LlamaPreTrainedModel):
    def __init__(self, config: LlamaConfig):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.gradient_checkpointing = False
        self.post_init()

inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(
    inputs, generation_config.bos_token_id, model_kwargs
)

model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
    inputs_tensor, generation_config.pad_token_id, generation_config.eos_token_id
)

residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
    hidden_states=hidden_states,
    attention_mask=attention_mask,
    position_ids=position_ids,
    past_key_value=past_key_value,
    output_attentions=output_attentions,
    use_cache=use_cache,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
if output_attentions:
    outputs += (self_attn_weights,)
if use_cache:
    outputs += (present_key_value,)
return outputs

class LlamaAttention(nn.Module):
    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads # head 的数量 这里是 32
        self.head_dim = self.hidden_size // self.num_heads  # head 的大小 这里是 128
        self.max_position_embeddings = config.max_position_embeddings
        if (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(...)
        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
        self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)

def forward(
    self,
    hidden_states: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_value: Optional[Tuple[torch.Tensor]] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
    bsz, q_len, _ = hidden_states.size()
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
    key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
    value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
    kv_seq_len = key_states.shape[-2]
    if past_key_value is not None:
        kv_seq_len += past_key_value[0].shape[-2]
    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
    if past_key_value is not None:
        key_states = torch.cat([past_key_value[0], key_states], dim=2)
        value_states = torch.cat([past_key_value[1], value_states], dim=2)
    past_key_value = (key_states, value_states) if use_cache else None
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask
        attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min))
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
    attn_output = torch.matmul(attn_weights, value_states)
    attn_output = attn_output.transpose(1, 2)
    attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
    attn_output = self.o_proj(attn_output)
    if not output_attentions:
        attn_weights = None
    return attn_output, attn_weights, past_key_value

input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
if streamer is not None:
    streamer.put(next_tokens.cpu())
model_kwargs = self._update_model_kwargs_for_generation(
    outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
)
if eos_token_id_tensor is not None:
    unfinished_sequences = unfinished_sequences.mul(
        next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
    )
if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
    break

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

input_sentences = [
    "DeepSpeed is a machine learning framework",
    "He is working on",
    "He has a",
    "He got all",
    "Everyone is happy and I can",
    "The new movie that got Oscar this year",
    "In the far far distance from our galaxy,",
    "Peace is the only way",
]
tokenizer.pad_token = 0
input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)

基于 LLAMA 的大语言模型 LLM 推理过程快速入门

大模型笔记：以 LLAMA 为例，快速入门 LLM 的推理过程

什么是 LLM

LLAMA 相关的知识点

分词器、Token、Embedding

什么是分词？

代码示例

更多推荐文章

相关免费在线工具

自注意力 Self-Attention

位置编码

多头注意力 (Multi-head Attention)

批标准化 (Batch Norm) & 层标准化 (Layer Norm)

残差网络 (ResNet)

LLAMA 的模型结构

运行 Pipeline

第一步分词

第二步配置

第三步 Sample

LlamaDecoderLayer

LlamaAttention

落地相关

Padding

Bad Words 和 Stop Words

KV-Cache

总结

一些概念

Unconditional Generation

Context Len

更多推荐文章

相关免费在线工具

基于 LLAMA 的大语言模型 LLM 推理过程快速入门

大模型笔记：以 LLAMA 为例，快速入门 LLM 的推理过程

什么是 LLM

LLAMA 相关的知识点

分词器、Token、Embedding

什么是分词？

代码示例

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

自注意力 Self-Attention

位置编码

多头注意力 (Multi-head Attention)

批标准化 (Batch Norm) & 层标准化 (Layer Norm)

残差网络 (ResNet)

LLAMA 的模型结构

运行 Pipeline

第一步 分词

第二步 配置

第三步 Sample

LlamaDecoderLayer

LlamaAttention

落地相关

Padding

Bad Words 和 Stop Words

KV-Cache

总结

一些概念

Unconditional Generation

Context Len

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

第一步分词

第二步配置