PythonAI算法

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

记录了在 DCU BW1000 计算卡上尝试使用 llama.cpp 和 transformers 推理 Qwen3-Coder-30B-A3B-Instruct-AWQ 模型的过程。通过 llmfit 分析发现显存不足以支持专家卸载，且模型文件路径存在差异。llama.cpp 因无法读取模型魔数报错，transformers 则因缺少 gptqmodel 依赖导致 AWQ 量化加载失败。环境库版本兼容性问题阻碍了推理流程的完成。

beaabea发布于 2026/3/22更新于 2026/7/2640 浏览

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

本次实验基于 DCU BW1000 计算卡环境。虽然硬件资源可用，但镜像配置有限，导致部分依赖和模型加载过程略显繁琐。

模型分析

首先通过 llmfit 工具评估目标模型 stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ 的适配情况：

=== stelterlab/Qwen3-Coder-30B-A3B-Instruct-AWQ ===
Provider: stelterlab
Parameters: 4.6B
Quantization: Q4_K_M
Best Quant: Q8_0
Context Length: 262144 tokens
Use Case: Code generation and completion
Category: Coding
Released: 2025-07-31
Runtime: llama.cpp (est. ~17.2 tok/s)

Score Breakdown:
  Overall Score: 66.7 / 100
  Quality: 68  Speed: 43  Fit: 61  Context: 100
  Estimated Speed: 17.2 tok/s

Resource Requirements:
  Min VRAM: 2.4 GB
  Min RAM: 2.6 GB

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp cmake -B build cmake --build build --config Release

export PATH=/root/llama.cpp/build/bin:$PATH

from modelscope import snapshot_download
snapshot_download('tclf90/Qwen3-Coder-30B-A3B-Instruct-AWQ', cache_dir="models")

llama-cli -m models/tclf90/Qwen3-Coder-30B-A3B-Instruct-AWQ

Loading model... |gguf_init_from_file_impl: failed to read magic
llama_model_load: error loading model: llama_model_loader: failed to load model from models/tclf90/Qwen3-Coder-30B-A3B-Instruct-AWQ
...
Failed to load the model

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "/root/models/tclf90/Qwen3-Coder-30B-A3B-Instruct-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
prompt = "Write a quick sort algorithm."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=65536)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)

File .../quantizer_awq.py:48, in AwqQuantizer.validate_environment(self, **kwargs)
    raise ImportError(
        "Loading an AWQ quantized model requires gptqmodel. Please install it with `pip install gptqmodel`"
    )
ImportError: Loading an AWQ quantized model requires gptqmodel. Please install it with `pip install gptqmodel`

pip install gptqmodel
# Exception: Unable to detect torch version via uv/pip/conda/importlib...
ERROR: Failed to build 'gptqmodel' when getting requirements to build wheel

conda install gptqmodel
# PackagesNotFoundError: The following packages are not available from current channels

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

模型分析

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

模型分析

更多推荐文章

相关免费在线工具

编译 llama.cpp

模型下载

推理尝试

方案一：llama-cli

方案二：Transformers

结论

更多推荐文章

相关免费在线工具

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

模型分析

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

DCU BW1000 环境下 llama.cpp 推理 Qwen3-Coder-30B 实践与问题排查

模型分析

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

编译 llama.cpp

模型下载

推理尝试

方案一：llama-cli

方案二：Transformers

结论

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具