LLM 大模型量化技术、QLoRA 及常用量化库详解 | 极客日志

PythonAI算法

LLM 大模型量化技术、QLoRA 及常用量化库详解

综述由AI生成详细阐述了 LLM 大模型的量化技术，涵盖浮点数与定点数转换原理、PTQ 与 QAT 分类、线性与非线性量化方法。重点解析了 BnB、GPTQ、AWQ、HQQ 等主流量化算法的核心机制与适用场景，并结合 QLoRA 展示了如何在低资源环境下进行模型微调。此外，文章对比了 AutoGPTQ、Bitsandbytes、GGML 等常用量化库的性能特点与代码实践，帮助开发者根据显存限制与精度需求选择合适的量化方案。

黑客发布于 2025/2/7更新于 2026/6/823 浏览

LLM 大模型量化技术、QLoRA 及常用量化库详解

模型的推理过程是一个复杂函数的计算过程，这个计算一般以矩阵乘法为主，也就是涉及到了并行计算。一般来说，单核 CPU 可以进行的计算种类更多，速度更快，但一般都是单条计算；而显卡能进行的都是基础的并行计算，做矩阵乘法再好不过。如果把所有的矩阵都加载到显卡上，就会导致显卡显存的占用大量增加，尤其是 LLM 模型大小从 7b、14b、34b 到几百 b 不等，占用显存的大小就是惊人的数字。如何在减少运算量和显存占用的条件下，做到推理效果不下降太多呢？在这里需要引入浮点数和定点数的概念。

1. 量化的定义和基本原理

量化是将模型浮点数变为定点数运行的过程。

双精度浮点数：在 PyTorch 中用 torch.float64 表示，或者在其他语言中也称为 double 类型，在 LLM 训练中一般比较少用。

全精度浮点数：在 PyTorch 中用 torch.float32 表示。

低精度浮点数：在 PyTorch 中用 torch.bfloat16 和 torch.float16 表示。这两个浮点数的差别如下：

bfloat16 的小数部分较短，整数部分较长，这会有利于在训练中减少梯度爆炸的情况（即梯度累加值超过了最大值），但是这种数据类型是在 N 系列显卡 Ampere 系列才支持的，即 30 系列 显卡。
float16 的小数部分较长，这意味着在精度控制上 float16 更好，但整数部分较短，比较容易梯度爆炸。

那么是否有更加减少显存占用和计算量的数值表达方式呢？那么可以考虑是否把浮点数转换为定点数（整数），整数计算更快更省显存，如果计算精度下降不大就很完美了。这种用整数计算代替浮点数计算的方法就是量化。

量化的基本原理是根据每个 tensor 的浮点型最大值和最小值，将其映射为一个固定范围的整形数值集合，比如 [-127~127]。假设一个简单的公式：qweight=round(weight/scale)，其中 qweight 代表量化后权重，weight 代表量化前权重，scale 代表缩放因子，可以看到在进行缩放后为了将浮点型转换为整数过程中增加了 round 操作丢失了小数部分。在后续计算或反量化为浮点型时存在无法完全还原的情况，这就是精度损失。

按照量化发生的步骤区分，可以划分为PTQ（训练后量化，或离线量化）和 QAT（训练感知型量化，或在线量化）。PTQ 量化可以分为 data-free 和 calibration 两种，前者不使用数据集进行校准直接计算量化因子，后者会根据少量真实数据进行统计分析并对量化因子进行额外校准，但耗费的时间更长。QAT 量化会先在待量化的算子上增加一个伪量化结构，并在训练时模拟量化过程并实时更新计算量化因子（类似反向传播过程）及原始权重。QAT 由于较为复杂一般作为辅助措施存在，用于改进 PTQ 量化的技术手段。

按照量化方法可以划分为线性量化、非线性量化（如对数量化）等多种方式，目前较为常用的是线性量化。其中线性量化又可以按照对称性划分为对称量化和非对称量化，非对称量化为了解决 weight 分布不均匀问题，其在公式中增加了 zero_point 项：qweight=round(weight/scale + zero_point)，使稠密数据部分可以得到更宽泛的数值范围。

按照量化粒度划分可以分为**逐层量化（每层使用一套量化因子）、逐组量化（在每层中按照 group 使用一套量化因子）、逐通道量化（按 channel 划分量化因子）**等几种方式。

按照量化最大值的阈值区分，可以分为饱和量化和不饱和量化两种。不饱和量化按照浮点数最大值和量化后最大值的比例计算量化因子，由于原始 weight 的非均匀性会导致某些整形数值范围存在权重空缺。饱和量化会计算一个中间值以计算出量化因子，因此会舍弃一部分不重要数据，将重要数据尽量均匀的分布到量化数值范围内。

按照量化后的比特数划分，可以分为 2 比特量化，4 比特量化，8 比特量化等类型。

一般来说，PyTorch 中量化模块的 forward 过程会先对量化权重进行反量化后使用浮点数进行计算。

量化简单来说：将用小数计算结果的模型，转换成用整数计算，中间自然有精度损失（因为小数位没了，而且浮点数翻译成整形再转回来是有损压缩过程）。

有了这个定义，我们就可以继续下面要讲的部分。在继续下面的内容之前，还是建议大家把上面的模型量化基础读一遍。下面会基于之前的文章，侧重几个方面进行技术分析：

BnB/HQQ/AWQ/GPTQ 等几种量化方法的原理
这几种量化方法一般怎么使用

1.1 原理篇

1.1.1 BnB 量化

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

pip install ms-swift

#autoawq 和 cuda 版本有对应关系，请按照官方文档选择版本
pip install autoawq -U

#auto_gptq 和 cuda 版本有对应关系，请按照官方文档选择版本
pip install auto_gptq -U

#hqq 和 eetq 使用暂时需要从源码下载 transformers 和 peft
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/peft.git
#hqq
pip install hqq
#eetq
git clone https://github.com/NetEase-FuXi/EETQ.git
cd EETQ/
git submodule update --init --recursive
pip install .

swift sft --model_type llama3-8b-instruct --dataset alpaca-en --quantization_bit 8 --quant_method bnb --sft_type lora

swift sft --model_type llama3-8b-instruct --dataset alpaca-en --quantization_bit 8 --quant_method eetq --sft_type lora

#GPTQ
OMP_NUM_THREADS=14 swift export --model_type llama3-8b-instruct --quant_method gptq --dataset alpaca-zh alpaca-en sharegpt-gpt4-mini --quant_seqlen 4096 --quant_bits 4
#AWQ
swift export --model_type llama3-8b-instruct --quant_bits 4 --quant_method awq --quant_n_samples 64 --quant_seqlen 4096 --dataset alpaca-zh alpaca-en sharegpt-gpt4-mini

#源代码 clone
#cd examples/pytorch/llm
#vim fsdp.sh 并写入下面的内容
#pip install bitsandbytes>=0.43.0
nproc_per_node=2

CUDA_VISIBLE_DEVICES=0,1 \
accelerate launch --config_file "./scripts/llama2_70b_chat/qlora_fsdp/fsdp_offload.json" \
    llm_sft.py \
    --model_type llama2-70b-chat \
    --model_revision master \
    --sft_type lora \
    --tuner_backend peft \
    --template_type AUTO \
    --dtype bf16 \
    --output_dir output \
    --dataset leetcode-python-en \
    --train_dataset_sample -1 \
    --num_train_epochs 1 \
    --max_length 2048 \
    --check_dataset_strategy warning \
    --quantization_bit 4 \
    --bnb_4bit_comp_dtype AUTO \
    --bnb_4bit_quant_storage bfloat16 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --lora_dtype AUTO \
    --lora_dropout_p 0.05 \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing true \
    --batch_size 1 \
    --weight_decay 0.1 \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 10 \

CUDA_VISIBLE_DEVICES=0 swift infer \
    --model_type qwen1half-7b-chat \
    --quant_method bnb \
    --quantization_bit 4

CUDA_VISIBLE_DEVICES=0 swift infer \
    --model_type qwen1half-7b-chat \
    --quant_method hqq \
    --quantization_bit 4

CUDA_VISIBLE_DEVICES=0 swift infer \
    --model_type qwen1half-7b-chat \
    --quant_method eetq \
    --dtype fp16

model	GPU	num_beams	fp16	gptq-int4
llama-7b	1xA100-40G	1	18.87	25.53
llama-7b	1xA100-40G	4	68.79	91.30
moss-moon 16b	1xA100-40G	1	12.48	15.25
moss-moon 16b	1xA100-40G	4	OOM	42.67
moss-moon 16b	2xA100-40G	1	06.83	06.78
moss-moon 16b	2xA100-40G	4	13.10	10.80
gpt-j 6b	1xRTX3060-12G	1	OOM	29.55
gpt-j 6b	1xRTX3060-12G	4	OOM	47.36

from modelscope import AutoTokenizer, snapshot_download
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging
import shutil
import os

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = snapshot_download("qwen/Qwen-1_8B-Chat")
quantized_model_dir = "qwen-1_8B-4bit"

shutil.rmtree(quantized_model_dir, ignore_errors=True)
shutil.copytree(pretrained_model_dir, quantized_model_dir)
for _file in os.listdir(quantized_model_dir):
    if ".safetensors" in _file or ".bin" in _file:
        os.remove(os.path.join(quantized_model_dir, _file))

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, trust_remote_code=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

#load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, trust_remote_code=True).to(0)

#quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

#save quantized model
model.save_quantized(quantized_model_dir)

#save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

#load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
#inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

swift sft --model_id_or_path qwen/Qwen-7B-Chat-Int4 --model_revision master --sft_type lora --tuner_backend swift --template_type qwen --dtype fp16 --output_dir output --dataset leetcode-python-en --train_dataset_sample -1 --num_train_epochs 1 --max_length 512 --check_dataset_strategy warning --lora_rank 8 --lora_alpha 32 --lora_dropout_p 0.05 --lora_target_modules ALL --gradient_checkpointing true --batch_size 1 --weight_decay 0.01 --learning_rate 1e-4

from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
  'qwen/Qwen-1_8B-Chat',
  load_in_8bit=True,
  trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained('qwen/Qwen-1_8B-Chat', trust_remote_code=True)

print(model(**tokenizer('how are you?', return_tensors='pt')))

MODEL=llama-7b

#run AWQ search (optional; we provided the pre-computed results)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt

#evaluate the AWQ quantize model (simulated pseudo quantization)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/$MODEL-w4-g128.pt \
    --q_backend fake

#generate real quantized weights (w4)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/$MODEL-w4-g128.pt \
    --q_backend real --dump_quant quant_cache/$MODEL-w4-g128-awq.pt

#load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --model_path /dataset/llama-hf/$MODEL \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/$MODEL-w4-g128-awq.pt

LLM 大模型量化技术、QLoRA 及常用量化库详解

LLM 大模型量化技术、QLoRA 及常用量化库详解

1. 量化的定义和基本原理

1.1 原理篇

1.1.1 BnB 量化

更多推荐文章

相关免费在线工具

1.1.2 absmax 量化

1.1.3 GPTQ 量化

1.1.4 AWQ 量化

1.1.5 HQQ 量化

小结

2. QLoRA

2.1 应用

3. 常见量化库

3.1 AutoGPTQ

3.2 Bitsandbytes

3.3 GGML

3.4 AWQ

总结

更多推荐文章

相关免费在线工具

LLM 大模型量化技术、QLoRA 及常用量化库详解

LLM 大模型量化技术、QLoRA 及常用量化库详解

1. 量化的定义和基本原理

1.1 原理篇

1.1.1 BnB 量化

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

1.1.2 absmax 量化

1.1.3 GPTQ 量化

1.1.4 AWQ 量化

1.1.5 HQQ 量化

小结

2. QLoRA

2.1 应用

3. 常见量化库

3.1 AutoGPTQ

3.2 Bitsandbytes

3.3 GGML

3.4 AWQ

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具