LLM 大模型推理加速与模型量化方法详解

综述由AI生成LLM 大模型推理加速的四种主流量化方法：AWQ、AutoGPTQ、bitsandbytes 和 llama.cpp。内容涵盖各方法的原理、安装步骤、核心代码实现及配置参数解析。通过对比不同量化方案在预填充和解码阶段的性能数据，提供了针对不同硬件环境和业务需求的选型建议，帮助开发者在显存受限情况下有效部署大模型。

GopherDev发布于 2025/2/7更新于 2026/6/221 浏览

大语言模型（LLM）的部署通常面临显存占用高、推理延迟大的挑战。模型量化（Quantization）通过将高精度的浮点权重转换为低精度整数，在保持模型性能基本不变的前提下，显著降低显存占用并提升推理速度。常见的量化方案包括 AWQ、AutoGPTQ、bitsandbytes 和 llama.cpp。

1. AWQ (Activation-aware Weight Quantization)

AWQ 是一种激活感知的权重量化方法，它通过识别对输出影响较大的通道来保护重要权重，从而在低比特下保持更高的精度。

安装依赖

pip install autoawq

量化代码示例

def quantize_awq():
    model_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat"
    quant_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat-awq"
    # 配置参数：w_bit=4 表示 4 位量化，group_size=128 为组大小
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
    
    # 加载模型
    from autoawq import AutoAWQForCausalLM
    from transformers import AutoTokenizer
    model = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
    # 执行量化
    model.quantize(tokenizer, quant_config=quant_config)
    
    print(quant_config)
    # 保存量化配置
    from awq.models.utils import AwqConfig
    quantization_config = AwqConfig(
        bits=quant_config["w_bit"],
        group_size=quant_config["q_group_size"],
        zero_point=quant_config["zero_point"],
        version=quant_config["version"].lower(),
    ).to_dict()
    
    model.model.config.quantization_config = quantization_config
    
    model.save_quantized(quant_path, safetensors=)
    tokenizer.save_pretrained(quant_path)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install -vvv -e .
pip install optimum

def quantize_gptq():
    model_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat"
    quant_path = "/home/chuan/models/qwen/Qwen1___5-7B-Chat-gptq"
    
    from transformers import AutoTokenizer, AutoModelForCausalLM
    from optimum.quanto import GPTQConfig
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    dataset = ["auto-gptq is an easy-to-use model quantization library."]
    
    gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)
    model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=gptq_config, device_map="auto", trust_remote_code=True)
    
    model.save_pretrained(quant_path)
    tokenizer.save_pretrained(quant_path)

pip install bitsandbytes

from transformers import AutoModelForCausalLM
model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True)

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,      # 嵌套量化，进一步减少内存
    bnb_4bit_quant_type="nf4",           # 针对正态分布权重的 4 位数据类型
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j LLAMA_CUBLAS=1

# 转换 HF 模型为 GGUF
python convert-hf-to-gguf.py /home/chuan/models/qwen/Qwen1___5-7B-Chat

# 量化到 4-bit (Q4_K_M 方法)
./quantize /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-f16.gguf /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf Q4_K_M

# 启动推理
./main -m /home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf -n 128

pip install llama-cpp-python

from llama_cpp import Llama
llm = Llama(
    model_path="/home/chuan/models/qwen/Qwen1___5-7B-Chat/ggml-model-Q4_K_M.gguf",
    n_gpu_layers=-1, # 启用 GPU 加速
    n_ctx=2048,      # 上下文窗口
)
output = llm(
    "Q: Name the planets in the solar system? A: ",
    max_tokens=32,
    stop=["Q:", "\n"],
    echo=True
)
print(output)

模型	prefill(bs=1)	decode(bs=1)	prefill(bs=2)	decode(bs=2)
base	8912	50	10248	91
bnb	5241	58	7298	57
gptq	9347	91	11333	151
awq	5677	68	7779	139

LLM 大模型推理加速与模型量化方法详解

1. AWQ (Activation-aware Weight Quantization)

更多推荐文章

相关免费在线工具

2. AutoGPTQ

3. bitsandbytes

4. llama.cpp

5. 性能对比与选型建议

更多推荐文章

相关免费在线工具

LLM 大模型推理加速与模型量化方法详解

1. AWQ (Activation-aware Weight Quantization)

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. AutoGPTQ

3. bitsandbytes

4. llama.cpp

5. 性能对比与选型建议

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具