ImportError: dlopen(/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so, 0x0002): tried: '/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so'(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so'(no such file), '/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so'(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))
python3 -m llama.download --model_size 7B
❤️ Resume download is supported. You can ctrl-c and rerun the program to resume the downloading
Downloading tokenizer...
✅ pyllama_data/tokenizer.model
✅ pyllama_data/tokenizer_checklist.chk
tokenizer.model: OK
Downloading 7B
downloading file to pyllama_data/7B/consolidated.00.pth ...please waitfor a few minutes ...
✅ pyllama_data/7B/consolidated.00.pth
✅ pyllama_data/7B/params.json
✅ pyllama_data/7B/checklist.chk
Checking checksums for the 7B model
consolidated.00.pth: OK
params.json: OK
2.1.3 脚本下载
#!/bin/bash# Function to handle stopping the scriptfunctionstop_script(){
echo"Stopping the script."exit 0
}
# Register the signal handlertrap stop_script SIGINT
whiletrue; do# Run the command with a timeout of 200 secondtimeout 2000 python -m llama.download --model_size $1 --folder model
echo"restart download"sleep 1
# Wait for 1 second before starting the next iteration# Wait for any key to be pressed within a 1-second timeoutread -t 1 -n 1 -s key
if [[ $key ]]; then
stop_script
fidone
llama.cpp 提供了完全与 OpenAI API 兼容的 API 接口,使用经过编译生成的 llama-server 可执行文件启动 API 服务。如果编译构建了 GPU 执行环境,可以使用 -ngl N 或 --n-gpu-layers N 参数,指定 offload 层数,让模型在 GPU 上运行推理。未使用 -ngl N 或 --n-gpu-layers N 参数,程序默认在 CPU 上运行。
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:noggml_cuda_init: CUDA_USE_TENSOR_CORES:yesggml_cuda_init: found 1 CUDA devices: Device 0:TeslaV100S-PCIE-32GB,computecapability7.0,VMM:yesllm_load_tensors:ggmlctxsize=0.30MiBllm_load_tensors:offloading32repeatinglayerstoGPUllm_load_tensors:offloadingnon-repeatinglayerstoGPUllm_load_tensors:offloaded33/33layerstoGPUllm_load_tensors:CPUbuffersize=1002.00MiBllm_load_tensors:CUDA0buffersize=14315.02MiB.........................................................................................llama_new_context_with_model:n_ctx=512llama_new_context_with_model:n_batch=512llama_new_context_with_model:n_ubatch=512llama_new_context_with_model:flash_attn=0
会启动一个类似 web 服务器的进程,默认端口号为 8080,这样就启动了一个 API 服务,可以使用 curl 命令进行测试。
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "What color is the sun?","n_predict": 512}'
{"content":".....","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,......}}
此外可通过 web 页面或者 OpenAI api 等进行访问。安装 openai 依赖。
pip install openai
使用 OpenAI api 访问:
import openai
client = openai.OpenAI(
base_url="http://127.0.0.1:8080/v1",
api_key="sk-no-key-required")
completion = client.chat.completions.create(
model="qwen",# model name can be chosen arbitrarily
messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"tell me something about michael jordan"}])
print(completion.choices[0].message.content)
3.3 模型 API 服务 (第三方,自己安装不需要)
在 llamm.cpp 项目中有提到各种语言编写的第三方工具包,可以使用这些工具包提供 API 服务,这里以 Python 为例,使用 llama-cpp-python 提供 API 服务。