DeepSeek R1 7B 模型在 RK3588 开发板上的 RKLLM 转换与 Web 部署 | 极客日志

PythonAI算法

DeepSeek R1 7B 模型在 RK3588 开发板上的 RKLLM 转换与 Web 部署

DeepSeek R1 7B 大语言模型在瑞芯微 RK3588 开发板上的完整部署方案。涵盖 NPU 驱动升级至 0.9.8、基于 x86 环境搭建 RKLLM 转换工具链、将 safetensors 权重转换为 rkllm 格式、板端 C++ 推理测试以及集成 Gradio 实现局域网 Web 访问。重点解决了 ChatML 模板适配、量化参数配置及环境变量设置等关键步骤，实现了本地化私有化部署。

beaabea发布于 2026/4/7更新于 2026/7/2338 浏览

DeepSeek R1 7B 模型在 RK3588 开发板上的 RKLLM 转换与 Web 部署

本文记录 DeepSeek R1 7B 以 Qwen 为底座的 LLM 在瑞芯微 RK3588 SoC 上的完整部署流程，从开发板驱动适配烧录开始，到最终的开发板终端访问模型和局域网 Web 访问模型的完整流程。

一、项目背景介绍

目前有一个空闲的 Firefly 出厂的搭载瑞芯微 RK3588 SoC 的 arm64 开发板。博主之前主要进行 CV 领域的模型的 RK 开发板部署，对于 LLM 和 VLM 的接触并不算多，但现在大模型是趋势所向，并且瑞芯微及时的完成了针对各开源 LLM/VLM 的适配工作，因此只需要开发手册要求按照要求即可完成模型部署流程。

二、所需工具介绍

1. 硬件工具

1. X86 PC 虚拟机 Ubuntu20.04

首先是 X86 的 PC，PC 上需要安装好 VMware 虚拟机，推荐 Ubuntu20.04 的，较为稳定。

2. 准备 NPU 驱动为 0.9.8 的 RK3588 开发板

其次准备好 RK3588 开发板，此处比较重要，因为需要检查好开发板的 NPU 驱动。输出命令：

sudo cat /sys/kernel/debug/rknpu/version

查看当前开发板的 NPU 驱动版本。可以看到，是 0.8.2 的版本，这个版本在调用之前的 CV 模型的.so 如 librknnrt.so 是没问题的，但是瑞芯微的 LLM 是要调用最新的 librkllm.so 的，这个 librkllm.so 对 NPU 驱动的最低版本要求是 0.9.8 的。

升级到 0.9.8 及以上版本有以下关键好处：

支持 RKLLM（大模型）：早期驱动（如 0.9.2）主要针对传统的计算机视觉模型（YOLO, ResNet 等）。LLM（大语言模型）引入了 Transformer 算子、KV Cache 优化等特性，这些都需要底层驱动 0.9.6+ 甚至 0.9.8+ 的指令集支持。
性能提升：新版驱动优化了内存管理和多核调度，推理速度会更快。
修复 Bug：修复了旧版本在高负载下可能出现的 NPU 挂死或内存泄漏问题。

因此，如果你的开发板的当前 NPU 驱动版本较低，则需要重新烧录，并且提前将所有数据备份好，因为烧录会清除当前所有数据。

以下是 Firefly RK3588 的烧录流程：

访问 Rockchip 官方文档下载固件，选中'固件'的'Ubuntu 固件'。
用浏览器打开，选择在 Ubuntu22.04/SDesktop/kernel-6.1 文件节下的压缩包。
下载完成后解压，在文件夹下可以看到 .img 文件。
找到 RKDevTool 烧写工具和 RK 驱动助手，安装驱动。
打开 RKDevTool，将开发板接上电源，通电以后，用 Type-C 数据线将 ARM 板和电脑连接。
此时发现窗口底部显示没有发现设备，需执行恢复模式操作：设备先断开电源适配器，USB 一端连接主机，Type-C 一端连接开发板 Type-C 母口，按住设备上的 RECOVERY(恢复) 键并保持，接上电源，大约两秒钟后，松开 RECOVERY 键。
此时窗口显示发现一个 LOADER 设备。点击上部菜单栏的 [升级固件]，然后点击 [固件]，选择相应的固件，点击 [打开]。
点击 [升级]，右侧状态栏会显示正在下载固件。下载固件成功后，ARM 板会自动重启。
重启完成后，打开开发板的终端，再次输入 sudo cat /sys/kernel/debug/rknpu/version 查看开发板 NPU 驱动，确认已升级到最新的 0.9.8 版本。

2. 软件工具

软件工具主要是需要先准备好各种模型，可以获取瑞芯微已经转换完成的 RKLLM 和 RKNN 模型，当前最好还是直接在 Hugging Face 下载开源的模型权重即.safetensors 文件，这样可以完整的体验一下模型转换的流程。

然后需要下载 RKNN-LLM-release，我们需要在 RKNN-LLM-release 项目里完成模型转换的环境配置以及模型推理。

Hugging Face: DeepSeek-R1-Distill-Qwen-7B 的模型仓库。

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

conda create -n rkllm123 python=3.10

pip install rkllm_toolkit-1.2.3-cp310-cp310-linux_x86_64.whl

from rkllm.api import RKLLM
import os

# 1. 定义路径
model_path = '/xxx/RKNN-LLM/rkllm/DeepSeek-R1-Distill-Qwen-7B' # 你的模型文件夹路径
platform = 'rk3588'
# 导出文件名
export_path = f'DeepSeek-R1-Distill-Qwen-7B_W8A8_{platform}.rkllm'

# 2. 初始化
llm = RKLLM()

# 3. 加载模型
print(">>> Loading model...")
# 注意：DeepSeek-R1 7B 模型较大，建议使用 device='cpu' 以免显存溢出，
# 除非你的 PC 有 24GB 以上显存的 NVIDIA 显卡
ret = llm.load_huggingface(
    model=model_path,
    device='cpu',
    dtype='float16' # 使用 float16 加载以节省内存
)
if ret != 0:
    print("Model Load Failed!")
    exit(ret)

# 4. 构建模型 (量化)
print(">>> Building model (Quantization W8A8)...")
# 7B 模型建议使用 W8A8 量化，W4A16 可能会有较大的精度损失
qparams = None
dataset = './data_quant.json' # 上一步生成的文件
ret = llm.build(
    do_quantization=True,
    optimization_level=1,
    quantized_dtype='w8a8', # RK3588 推荐 W8A8
    quantized_algorithm='normal',
    target_platform=platform,
    num_npu_core=3, # RK3588 只有 3 个核心
    dataset=dataset
)
if ret != 0:
    print("Model Build Failed!")
    exit(ret)

# 5. 导出模型
print(f">>> Exporting model to {export_path}...")
ret = llm.export_rkllm(export_path)
if ret != 0:
    print("Model Export Failed!")
    exit(ret)
print("\n\n转换成功！请将 .rkllm 文件推送到板端进行测试。")

import json
from transformers import AutoTokenizer

# 修改为你下载的模型的实际路径
model_path = '/xxx/RKNN-LLM/rkllm/DeepSeek-R1-Distill-Qwen-7B'

# 准备一些校准用的提示词（包含中文和英文，覆盖不同场景）
prompts = [
    "你好，请介绍一下你自己。",
    "Explain the theory of relativity in simple terms.",
    "写一首关于春天的七言绝句。",
    "Solve the equation: 2x + 5 = 15.",
    "瑞芯微 RK3588 芯片的主要特点是什么？",
    "What implies 'DeepSeek-R1'?",
    "将以下 JSON 字符串转换为 Python 字典：{'a': 1, 'b': 2}",
    "请帮我写一个 Python 脚本，实现快速排序算法。"
]

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
data_list = []
for prompt in prompts:
    # 构造对话格式
    messages = [{"role":"user","content": prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # RKLLM 量化数据集格式要求：{"input": ..., "target": ...}
    # target 可以为空，主要用 input 做校准
    data_list.append({"input": text,"target":""})

# 保存为 json 文件
with open('data_quant.json','w', encoding='utf-8') as f:
    json.dump(data_list, f, ensure_ascii=False, indent=4)
print("量化数据已生成：data_quant.json")

python generate_data.py

python export_model.py

// 【修改开始】适配 DeepSeek-R1 / Qwen 的 ChatML 模板 // 参数顺序：handle, system_prompt, user_prefix, assistant_prefix
rkllm_set_chat_template(llmHandle,"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n","<|im_start|>user\n","<|im_end|>\n<|im_start|>assistant\n");
// 【修改结束】

// Copyright (c) 2025 by Rockchip Electronics Co., Ltd. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include<string.h>
#include<unistd.h>
#include<string>
#include"rkllm.h"
#include<fstream>
#include<iostream>
#include<csignal>
#include<vector>
using namespace std;

LLMHandle llmHandle = nullptr;
void exit_handler(int signal){
    if(llmHandle != nullptr){
        cout <<"程序即将退出"<< endl;
        LLMHandle _tmp = llmHandle;
        llmHandle = nullptr;
        rkllm_destroy(_tmp);
    }
    exit(signal);
}

int callback(RKLLMResult *result,void*userdata, LLMCallState state){
    if(state == RKLLM_RUN_FINISH){
        printf("\n");
    }else if(state == RKLLM_RUN_ERROR){
        printf("\\run error\n");
    }else if(state == RKLLM_RUN_NORMAL){
        /* ================================================================================================================
         若使用 GET_LAST_HIDDEN_LAYER 功能，callback 接口会回传内存指针:last_hidden_layer，token 数量:num_tokens 与隐藏层大小:embd_size
         通过这三个参数可以取得 last_hidden_layer 中的数据
         注：需要在当前 callback 中获取，若未及时获取，下一次 callback 会将该指针释放
         ===============================================================================================================*/
        if(result->last_hidden_layer.embd_size !=0&& result->last_hidden_layer.num_tokens !=0){
            int data_size = result->last_hidden_layer.embd_size * result->last_hidden_layer.num_tokens *sizeof(float);
            printf("\ndata_size:%d",data_size);
            std::ofstream outFile("last_hidden_layer.bin", std::ios::binary);
            if(outFile.is_open()){
                outFile.write(reinterpret_cast<constchar*>(result->last_hidden_layer.hidden_states), data_size);
                outFile.close();
                std::cout <<"Data saved to output.bin successfully!"<< std::endl;
            }else{
                std::cerr <<"Failed to open the file for writing!"<< std::endl;
            }
        }
        printf("%s", result->text);
    }
    return 0;
}

int main(int argc,char**argv){
    if(argc <4){
        std::cerr <<"Usage: "<< argv[0]<<" model_path max_new_tokens max_context_len\n";
        return 1;
    }
    signal(SIGINT, exit_handler);
    printf("rkllm init start\n");
    //设置参数及初始化
    RKLLMParam param = rkllm_createDefaultParam();
    param.model_path = argv[1];
    //设置采样参数
    param.top_k = 1;
    param.top_p = 0.95;
    param.temperature = 0.8;
    param.repeat_penalty = 1.1;
    param.frequency_penalty = 0.0;
    param.presence_penalty = 0.0;
    param.max_new_tokens = std::atoi(argv[2]);
    param.max_context_len = std::atoi(argv[3]);
    param.skip_special_token = true;
    param.extend_param.base_domain_id = 0;
    param.extend_param.embed_flash = 1;
    int ret = rkllm_init(&llmHandle,&param, callback);
    if(ret ==0){
        printf("rkllm init success\n");
    }else{
        printf("rkllm init failed\n");
        exit_handler(-1);
    }
    vector<string> pre_input;
    pre_input.push_back("现有一笼子，里面有鸡和兔子若干只，数一数，共有头 14 个，腿 38 条，求鸡和兔子各有多少只？");
    pre_input.push_back("有 28 位小朋友排成一行，从左边开始数第 10 位是学豆，从右边开始数他是第几位？");
    cout <<"\n**********************可输入以下问题对应序号获取回答/或自定义输入********************\n"<< endl;
    for(int i =0; i <(int)pre_input.size(); i++){
        cout <<"["<< i <<"] "<< pre_input[i]<< endl;
    }
    cout <<"\n*************************************************************************\n"<< endl;
    RKLLMInput rkllm_input;
    memset(&rkllm_input,0,sizeof(RKLLMInput));
    // 将所有内容初始化为 0
    // 初始化 infer 参数结构体
    RKLLMInferParam rkllm_infer_params;
    memset(&rkllm_infer_params,0,sizeof(RKLLMInferParam));
    // 将所有内容初始化为 0
    // 1. 初始化并设置 LoRA 参数（如果需要使用 LoRA）
    // RKLLMLoraAdapter lora_adapter;
    // memset(&lora_adapter, 0, sizeof(RKLLMLoraAdapter));
    // lora_adapter.lora_adapter_path = "qwen0.5b_fp16_lora.rkllm";
    // lora_adapter.lora_adapter_name = "test";
    // lora_adapter.scale = 1.0;
    // ret = rkllm_load_lora(llmHandle, &lora_adapter);
    // if (ret != 0) {
    //     printf("\nload lora failed\n");
    // }
    // 加载第二个 lora
    // lora_adapter.lora_adapter_path = "Qwen2-0.5B-Instruct-all-rank8-F16-LoRA.gguf";
    // lora_adapter.lora_adapter_name = "knowledge_old";
    // lora_adapter.scale = 1.0;
    // ret = rkllm_load_lora(llmHandle, &lora_adapter);
    // if (ret != 0) {
    //     printf("\nload lora failed\n");
    // }
    // RKLLMLoraParam lora_params;
    // lora_params.lora_adapter_name = "test"; // 指定用于推理的 lora 名称
    // rkllm_infer_params.lora_params = &lora_params;
    // 2. 初始化并设置 Prompt Cache 参数（如果需要使用 prompt cache）
    // RKLLMPromptCacheParam prompt_cache_params;
    // prompt_cache_params.save_prompt_cache = true; // 是否保存 prompt cache
    // prompt_cache_params.prompt_cache_path = "./prompt_cache.bin"; // 若需要保存 prompt cache，指定 cache 文件路径
    // rkllm_infer_params.prompt_cache_params = &prompt_cache_params;
    // rkllm_load_prompt_cache(llmHandle, "./prompt_cache.bin"); // 加载缓存的 cache
    rkllm_infer_params.mode = RKLLM_INFER_GENERATE;
    // By default, the chat operates in single-turn mode (no context retention)
    // 0 means no history is retained, each query is independent
    rkllm_infer_params.keep_history = 0;
    //The model has a built-in chat template by default, which defines how prompts are formatted
    //for conversation. Users can modify this template using this function to customize the
    //system prompt, prefix, and postfix according to their needs.
    // rkllm_set_chat_template(llmHandle, "", "<｜User｜>", "<｜Assistant｜>");
    // 【修改开始】适配 DeepSeek-R1 / Qwen 的 ChatML 模板
    // 参数顺序：handle, system_prompt, user_prefix, assistant_prefix
    rkllm_set_chat_template(llmHandle,"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n","<|im_start|>user\n","<|im_end|>\n<|im_start|>assistant\n");
    // 【修改结束】
    while(true){
        std::string input_str;
        printf("\n");
        printf("user: ");
        std::getline(std::cin, input_str);
        if(input_str =="exit"){break;}
        if(input_str =="clear"){
            ret = rkllm_clear_kv_cache(llmHandle,1,nullptr,nullptr);
            if(ret !=0){printf("clear kv cache failed!\n");}
            continue;
        }
        for(int i =0; i <(int)pre_input.size(); i++){
            if(input_str ==to_string(i)){
                input_str = pre_input[i];
                cout << input_str << endl;
            }
        }
        rkllm_input.input_type = RKLLM_INPUT_PROMPT;
        rkllm_input.role ="user";
        rkllm_input.prompt_input =(char*)input_str.c_str();
        printf("robot: ");
        // 若要使用普通推理功能，则配置 rkllm_infer_mode 为 RKLLM_INFER_GENERATE 或不配置参数
        rkllm_run(llmHandle,&rkllm_input,&rkllm_infer_params,NULL);
    }
    rkllm_destroy(llmHandle);
    return 0;
}

#!/bin/bash
# Debug / Release / RelWithDebInfo
set -e
if[[ -z ${BUILD_TYPE}]];
then
BUILD_TYPE=Release
fi
# ================= 修改重点 =================
# 在板端本地编译，直接使用简写，系统会自动在 /usr/bin 下找到它们
C_COMPILER=gcc
CXX_COMPILER=g++
# ===========================================
TARGET_ARCH=aarch64
TARGET_PLATFORM=linux
if[[ -n ${TARGET_ARCH}]];
then
TARGET_PLATFORM=${TARGET_PLATFORM}_${TARGET_ARCH}
fi
ROOT_PWD=$(cd"$(dirname $0 )"&&cd -P "$(dirname"$SOURCE")"&&pwd)
BUILD_DIR=${ROOT_PWD}/build/build_${TARGET_PLATFORM}_${BUILD_TYPE}
if[[! -d "${BUILD_DIR}"]];
then
mkdir -p ${BUILD_DIR}
fi
cd${BUILD_DIR}
cmake ../..\
    -DCMAKE_SYSTEM_PROCESSOR=${TARGET_ARCH}\
    -DCMAKE_SYSTEM_NAME=Linux\
    -DCMAKE_C_COMPILER=${C_COMPILER}\
    -DCMAKE_CXX_COMPILER=${CXX_COMPILER}\
    -DCMAKE_BUILD_TYPE=${BUILD_TYPE}\
    -DCMAKE_POSITION_INDEPENDENT_CODE=ON
make -j4
make install

bash build-linux.sh

export LD_LIBRARY_PATH=~/rknn-llm-release-v1.2.3/rkllm-runtime/Linux/librkllm_api/aarch64/lib:$LD_LIBRARY_PATH

cd /xxx/rknn-llm-release-v1.2.3/examples/rkllm_api_demo/deploy/install/demo_Linux_aarch64
# 用法：./llm_demo <模型路径> <最大生成长度> <上下文长度>
./llm_demo /home/firefly/models/DeepSeek-R1-Distill-Qwen-7B_W8A8_rk3588.rkllm 512 2048

pip3 install gradio

cd ~/rknn-llm-release-v1.2.3/examples/rkllm_server_demo/rkllm_server

cp ~/rknn-llm-release-v1.2.3/rkllm-runtime/Linux/librkllm_api/aarch64/librkllmrt.so ./lib/

import ctypes
import sys
import os
import subprocess
import resource
import threading
import time
import gradio as gr
import argparse

# PROMPT_TEXT_PREFIX = "<|im_start|>system You are a helpful assistant. <|im_end|> <|im_start|>user"
# PROMPT_TEXT_POSTFIX = "<|im_end|><|im_start|>assistant"

# Set environment variables
os.environ["GRADIO_SERVER_NAME"]="0.0.0.0"
os.environ["GRADIO_SERVER_PORT"]="8080"

# Set the dynamic library path
rkllm_lib = ctypes.CDLL('lib/librkllmrt.so')

# Define the structures from the library
RKLLM_Handle_t = ctypes.c_void_p
userdata = ctypes.c_void_p(None)
LLMCallState = ctypes.c_int
LLMCallState.RKLLM_RUN_NORMAL = 0
LLMCallState.RKLLM_RUN_WAITING = 1
LLMCallState.RKLLM_RUN_FINISH = 2
LLMCallState.RKLLM_RUN_ERROR = 3
RKLLMInputType = ctypes.c_int
RKLLMInputType.RKLLM_INPUT_PROMPT = 0
RKLLMInputType.RKLLM_INPUT_TOKEN = 1
RKLLMInputType.RKLLM_INPUT_EMBED = 2
RKLLMInputType.RKLLM_INPUT_MULTIMODAL = 3
RKLLMInferMode = ctypes.c_int
RKLLMInferMode.RKLLM_INFER_GENERATE = 0
RKLLMInferMode.RKLLM_INFER_GET_LAST_HIDDEN_LAYER = 1
RKLLMInferMode.RKLLM_INFER_GET_LOGITS = 2

class RKLLMExtendParam(ctypes.Structure):
    _fields_ = [
        ("base_domain_id", ctypes.c_int32),
        ("embed_flash", ctypes.c_int8),
        ("enabled_cpus_num", ctypes.c_int8),
        ("enabled_cpus_mask", ctypes.c_uint32),
        ("n_batch", ctypes.c_uint8),
        ("use_cross_attn", ctypes.c_int8),
        ("reserved", ctypes.c_uint8 * 104)
    ]

class RKLLMParam(ctypes.Structure):
    _fields_ = [
        ("model_path", ctypes.c_char_p),
        ("max_context_len", ctypes.c_int32),
        ("max_new_tokens", ctypes.c_int32),
        ("top_k", ctypes.c_int32),
        ("n_keep", ctypes.c_int32),
        ("top_p", ctypes.c_float),
        ("temperature", ctypes.c_float),
        ("repeat_penalty", ctypes.c_float),
        ("frequency_penalty", ctypes.c_float),
        ("presence_penalty", ctypes.c_float),
        ("mirostat", ctypes.c_int32),
        ("mirostat_tau", ctypes.c_float),
        ("mirostat_eta", ctypes.c_float),
        ("skip_special_token", ctypes.c_bool),
        ("is_async", ctypes.c_bool),
        ("img_start", ctypes.c_char_p),
        ("img_end", ctypes.c_char_p),
        ("img_content", ctypes.c_char_p),
        ("extend_param", RKLLMExtendParam),
    ]

class RKLLMLoraAdapter(ctypes.Structure):
    _fields_ = [
        ("lora_adapter_path", ctypes.c_char_p),
        ("lora_adapter_name", ctypes.c_char_p),
        ("scale", ctypes.c_float)
    ]

class RKLLMEmbedInput(ctypes.Structure):
    _fields_ = [
        ("embed", ctypes.POINTER(ctypes.c_float)),
        ("n_tokens", ctypes.c_size_t)
    ]

class RKLLMTokenInput(ctypes.Structure):
    _fields_ = [
        ("input_ids", ctypes.POINTER(ctypes.c_int32)),
        ("n_tokens", ctypes.c_size_t)
    ]

class RKLLMMultiModalInput(ctypes.Structure):
    _fields_ = [
        ("prompt", ctypes.c_char_p),
        ("image_embed", ctypes.POINTER(ctypes.c_float)),
        ("n_image_tokens", ctypes.c_size_t),
        ("n_image", ctypes.c_size_t),
        ("image_width", ctypes.c_size_t),
        ("image_height", ctypes.c_size_t)
    ]

class RKLLMInputUnion(ctypes.Union):
    _fields_ = [
        ("prompt_input", ctypes.c_char_p),
        ("embed_input", RKLLMEmbedInput),
        ("token_input", RKLLMTokenInput),
        ("multimodal_input", RKLLMMultiModalInput)
    ]

class RKLLMInput(ctypes.Structure):
    _fields_ = [
        ("role", ctypes.c_char_p),
        ("enable_thinking", ctypes.c_bool),
        ("input_type", RKLLMInputType),
        ("input_data", RKLLMInputUnion)
    ]

class RKLLMLoraParam(ctypes.Structure):
    _fields_ = [
        ("lora_adapter_name", ctypes.c_char_p)
    ]

class RKLLMPromptCacheParam(ctypes.Structure):
    _fields_ = [
        ("save_prompt_cache", ctypes.c_int),
        ("prompt_cache_path", ctypes.c_char_p)
    ]

class RKLLMInferParam(ctypes.Structure):
    _fields_ = [
        ("mode", RKLLMInferMode),
        ("lora_params", ctypes.POINTER(RKLLMLoraParam)),
        ("prompt_cache_params", ctypes.POINTER(RKLLMPromptCacheParam)),
        ("keep_history", ctypes.c_int)
    ]

class RKLLMResultLastHiddenLayer(ctypes.Structure):
    _fields_ = [
        ("hidden_states", ctypes.POINTER(ctypes.c_float)),
        ("embd_size", ctypes.c_int),
        ("num_tokens", ctypes.c_int)
    ]

class RKLLMResultLogits(ctypes.Structure):
    _fields_ = [
        ("logits", ctypes.POINTER(ctypes.c_float)),
        ("vocab_size", ctypes.c_int),
        ("num_tokens", ctypes.c_int)
    ]

class RKLLMPerfStat(ctypes.Structure):
    _fields_ = [
        ("prefill_time_ms", ctypes.c_float),
        ("prefill_tokens", ctypes.c_int),
        ("generate_time_ms", ctypes.c_float),
        ("generate_tokens", ctypes.c_int),
        ("memory_usage_mb", ctypes.c.float)
    ]

class RKLLMResult(ctypes.Structure):
    _fields_ = [
        ("text", ctypes.c_char_p),
        ("token_id", ctypes.c_int),
        ("last_hidden_layer", RKLLMResultLastHiddenLayer),
        ("logits", RKLLMResultLogits),
        ("perf", RKLLMPerfStat)
    ]

# Define global variables to store the callback function output for displaying in the Gradio interface
global_text = []
global_state = -1
split_byte_data = bytes(b"") # Used to store the segmented byte data

# Define the callback function
def callback_impl(result, userdata, state):
    global global_text, global_state, split_byte_data
    if state == LLMCallState.RKLLM_RUN_FINISH:
        global_state = state
        print("\n")
        sys.stdout.flush()
    elif state == LLMCallState.RKLLM_RUN_ERROR:
        global_state = state
        print("run error")
        sys.stdout.flush()
    elif state == LLMCallState.RKLLM_RUN_NORMAL:
        global_state = state
        global_text += result.contents.text.decode('utf-8')
    return 0

# Connect the callback function between the Python side and the C++ side
callback_type = ctypes.CFUNCTYPE(ctypes.c_int, ctypes.POINTER(RKLLMResult), ctypes.c_void_p, ctypes.c_int)
callback = callback_type(callback_impl)

# Define the RKLLM class, which includes initialization, inference, and release operations for the RKLLM model in the dynamic library
class RKLLM(object):
    def __init__(self, model_path, lora_model_path = None, prompt_cache_path = None, platform = "rk3588"):
        rkllm_param = RKLLMParam()
        rkllm_param.model_path = bytes(model_path,'utf-8')
        rkllm_param.max_context_len = 4096
        rkllm_param.max_new_tokens = 4096
        rkllm_param.skip_special_token = True
        rkllm_param.n_keep = -1
        rkllm_param.top_k = 1
        rkllm_param.top_p = 0.9
        rkllm_param.temperature = 0.8
        rkllm_param.repeat_penalty = 1.1
        rkllm_param.frequency_penalty = 0.0
        rkllm_param.presence_penalty = 0.0
        rkllm_param.mirostat = 0
        rkllm_param.mirostat_tau = 5.0
        rkllm_param.mirostat_eta = 0.1
        rkllm_param.is_async = False
        rkllm_param.img_start = "".encode('utf-8')
        rkllm_param.img_end = "".encode('utf-8')
        rkllm_param.img_content = "".encode('utf-8')
        rkllm_param.extend_param.base_domain_id = 0
        rkllm_param.extend_param.embed_flash = 1
        rkllm_param.extend_param.n_batch = 1
        rkllm_param.extend_param.use_cross_attn = 0
        rkllm_param.extend_param.enabled_cpus_num = 4
        if platform.lower() in ["rk3576","rk3588"]:
            rkllm_param.extend_param.enabled_cpus_mask = (1<<4)|(1<<5)|(1<<6)|(1<<7)
        else:
            rkllm_param.extend_param.enabled_cpus_mask = (1<<0)|(1<<1)|(1<<2)|(1<<3)
        self.handle = RKLLM_Handle_t()
        self.rkllm_init = rkllm_lib.rkllm_init
        self.rkllm_init.argtypes = [ctypes.POINTER(RKLLM_Handle_t), ctypes.POINTER(RKLLMParam), callback_type]
        self.rkllm_init.restype = ctypes.c_int
        ret = self.rkllm_init(ctypes.byref(self.handle), ctypes.byref(rkllm_param), callback)
        if(ret !=0):
            print("\nrkllm init failed\n")
            exit(0)
        else:
            print("\nrkllm init success!\n")
        self.rkllm_run = rkllm_lib.rkllm_run
        self.rkllm_run.argtypes = [RKLLM_Handle_t, ctypes.POINTER(RKLLMInput), ctypes.POINTER(RKLLMInferParam), ctypes.c_void_p]
        self.rkllm_run.restype = ctypes.c_int
        self.set_chat_template = rkllm_lib.rkllm_set_chat_template
        self.set_chat_template.argtypes = [RKLLM_Handle_t, ctypes.c_char_p, ctypes.c_char_p, ctypes.c_char_p]
        self.set_chat_template.restype = ctypes.c_int
        #★★★★★进行注释替换
        # system_prompt = "<|im_start|>system You are a helpful assistant. <|im_end|>"
        # prompt_prefix = "<|im_start|>user"
        # prompt_postfix = "<|im_end|><|im_start|>assistant"
        # # self.set_chat_template(self.handle, ctypes.c_char_p(system_prompt.encode('utf-8')), ctypes.c_char_p(prompt_prefix.encode('utf-8')), ctypes.c_char_p(prompt_postfix.encode('utf-8')))
        system_prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
        prompt_prefix = "<|im_start|>user\n"
        prompt_postfix = "<|im_end|>\n<|im_start|>assistant\n"
        self.set_chat_template(self.handle, ctypes.c_char_p(system_prompt.encode('utf-8')), ctypes.c_char_p(prompt_prefix.encode('utf-8')), ctypes.c_char_p(prompt_postfix.encode('utf-8')))
        self.rkllm_destroy = rkllm_lib.rkllm_destroy
        self.rkllm_destroy.argtypes = [RKLLM_Handle_t]
        self.rkllm_destroy.restype = ctypes.c_int
        rkllm_lora_params = None
        if lora_model_path:
            lora_adapter_name = "test"
            lora_adapter = RKLLMLoraAdapter()
            ctypes.memset(ctypes.byref(lora_adapter),0, ctypes.sizeof(RKLLMLoraAdapter))
            lora_adapter.lora_adapter_path = ctypes.c_char_p((lora_model_path).encode('utf-8'))
            lora_adapter.lora_adapter_name = ctypes.c_char_p((lora_adapter_name).encode('utf-8'))
            lora_adapter.scale = 1.0
            rkllm_load_lora = rkllm_lib.rkllm_load_lora
            rkllm_load_lora.argtypes = [RKLLM_Handle_t, ctypes.POINTER(RKLLMLoraAdapter)]
            rkllm_load_lora.restype = ctypes.c_int
            rkllm_load_lora(self.handle, ctypes.byref(lora_adapter))
            rkllm_lora_params = RKLLMLoraParam()
            rkllm_lora_params.lora_adapter_name = ctypes.c_char_p((lora_adapter_name).encode('utf-8'))
        self.rkllm_infer_params = RKLLMInferParam()
        ctypes.memset(ctypes.byref(self.rkllm_infer_params),0, ctypes.sizeof(RKLLMInferParam))
        self.rkllm_infer_params.mode = RKLLMInferMode.RKLLM_INFER_GENERATE
        self.rkllm_infer_params.lora_params = ctypes.pointer(rkllm_lora_params) if rkllm_lora_params else None
        self.rkllm_infer_params.keep_history = 0
        self.prompt_cache_path = None
        if prompt_cache_path:
            self.prompt_cache_path = prompt_cache_path
        rkllm_load_prompt_cache = rkllm_lib.rkllm_load_prompt_cache
        rkllm_load_prompt_cache.argtypes = [RKLLM_Handle_t, ctypes.c_char_p]
        rkllm_load_prompt_cache.restype = ctypes.c_int
        rkllm_load_prompt_cache(self.handle, ctypes.c_char_p((prompt_cache_path).encode('utf-8')))

    def run(self, prompt):
        rkllm_input = RKLLMInput()
        rkllm_input.role = "user".encode('utf-8')
        rkllm_input.enable_thinking = ctypes.c_bool(False)
        rkllm_input.input_mode = RKLLMInputType.RKLLM_INPUT_PROMPT
        rkllm_input.input_data.prompt_input = ctypes.c_char_p(prompt.encode('utf-8'))
        self.rkllm_run(self.handle, ctypes.byref(rkllm_input), ctypes.byref(self.rkllm_infer_params),None)
        return

    def release(self):
        self.rkllm_destroy(self.handle)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--rkllm_model_path',type=str, required=True,help='Absolute path of the converted RKLLM model on the Linux board;')
    parser.add_argument('--target_platform',type=str, required=True,help='Target platform: e.g. rk3588/rk3576;')
    parser.add_argument('--lora_model_path',type=str,help='Absolute path of the lora_model on the Linux board;')
    parser.add_argument('--prompt_cache_path',type=str,help='Absolute path of the prompt_cache file on the Linux board;')
    args = parser.parse_args()
    if not os.path.exists(args.rkllm_model_path):
        print("Error: Please provide the correct rkllm model path, and ensure it is the absolute path on the board.")
        sys.stdout.flush()
        exit()
    if not(args.target_platform in ["rk3588","rk3576","rv1126b","rk3562"]):
        print("Error: Please specify the correct target platform: rk3588/rk3576/rv1126b/rk3562.")
        sys.stdout.flush()
        exit()
    if args.lora_model_path:
        if not os.path.exists(args.lora_model_path):
            print("Error: Please provide the correct lora_model path, and advise it is the absolute path on the board.")
            sys.stdout.flush()
            exit()
    if args.prompt_cache_path:
        if not os.path.exists(args.prompt_cache_path):
            print("Error: Please provide the correct prompt_cache_file path, and advise it is the absolute path on the board.")
            sys.stdout.flush()
            exit()
    # Fix frequency
    #★★★★★把下面注释掉
    # command = "sudo bash fix_freq_{}.sh".format(args.target_platform)
    # subprocess.run(command, shell=True)
    # Set resource limit
    resource.setrlimit(resource.RLIMIT_NOFILE,(102400,102400))
    # Initialize RKLLM model
    print("=========init....===========")
    sys.stdout.flush()
    model_path = args.rkllm_model_path
    rkllm_model = RKLLM(model_path, args.lora_model_path, args.prompt_cache_path, args.target_platform)
    print("==============================")
    sys.stdout.flush()
    # Record the user's input prompt
    def get_user_input(user_message, history):
        history = history + [[user_message,None]]
        return "", history
    # Retrieve the output from the RKLLM model and print it in a streaming manner
    def get_RKLLM_output(history):
        # Link global variables to retrieve the output information from the callback function
        global global_text, global_state
        global_text = []
        global_state = -1
        # Create a thread for model inference
        model_thread = threading.Thread(target=rkllm_model.run, args=(history[-1][0],))
        model_thread.start()
        # history[-1][1] represents the current dialogue history
        history[-1][1]=""
        # Wait for the model to finish running and periodically check the inference thread of the model
        model_thread_finished = False
        while not model_thread_finished:
            while len(global_text)>0:
                history[-1][1]+= global_text.pop(0)
                time.sleep(0.005)
            # Gradio automatically pushes the result returned by the yield statement when calling the then method
            yield history
            model_thread.join(timeout=0.005)
            model_thread_finished = not model_thread.is_alive()
    # Create a Gradio interface
    with gr.Blocks(title="Chat with RKLLM") as chatRKLLM:
        gr.Markdown("<div><font size='70'> Chat with RKLLM </font></div>")
        gr.Markdown("### Enter your question in the inputTextBox and press the Enter key to chat with the RKLLM model.")
        # Create a Chatbot component to display conversation history
        rkllmServer = gr.Chatbot(height=600)
        # #★★★★★进行修改
        # rkllmServer = gr.Chatbot(height=600, type="tuples")
        # Create a Textbox component for user message input
        msg = gr.Textbox(placeholder="Please input your question here...", label="inputTextBox")
        # Create a Button component to clear the chat history.
        clear = gr.Button("Clear")
        # Submit the user's input message to the get_user_input function and immediately update the chat history.
        # Then call the get_RKLLM_output function to further update the chat history.
        # The queue=False parameter ensures that these updates are not queued, but executed immediately.
        msg.submit(get_user_input,[msg, rkllmServer],[msg, rkllmServer], queue=False).then(get_RKLLM_output, rkllmServer, rkllmServer)
        # When the clear button is clicked, perform a no-operation (lambda: None) and immediately clear the chat history.
        clear.click(lambda:None,None, rkllmServer, queue=False)
        # Enable the event queue system.
        chatRKLLM.queue()
        # Start the Gradio application.
        # chatRKLLM.launch()
        chatRKLLM.launch(server_name="0.0.0.0", server_port=8080)
    print("====================")
    print("RKLLM model inference completed, releasing RKLLM model resources...")
    rkllm_model.release()
    print("====================")

python3 gradio_server.py --model_path /home/firefly/rkllm_model_zoo_selfconvert/DeepSeek-R1-Distill-Qwen-7B_W8A8_rk3588.rkllm --target_platform rk3588

DeepSeek R1 7B 模型在 RK3588 开发板上的 RKLLM 转换与 Web 部署

DeepSeek R1 7B 模型在 RK3588 开发板上的 RKLLM 转换与 Web 部署

一、项目背景介绍

二、所需工具介绍

1. 硬件工具

1. X86 PC 虚拟机 Ubuntu20.04

2. 准备 NPU 驱动为 0.9.8 的 RK3588 开发板

2. 软件工具

更多推荐文章

相关免费在线工具

三、获取.safetensors 模型权重

四、safetensors 转 RKLLM

1. 转换环境搭建

2. 模型转换

五、RKLLM 模型板端部署及推理

六、集成开源 gradio 工具实现 web 访问

更多推荐文章

相关免费在线工具

DeepSeek R1 7B 模型在 RK3588 开发板上的 RKLLM 转换与 Web 部署

DeepSeek R1 7B 模型在 RK3588 开发板上的 RKLLM 转换与 Web 部署

一、项目背景介绍

二、所需工具介绍

1. 硬件工具

1. X86 PC 虚拟机 Ubuntu20.04

2. 准备 NPU 驱动为 0.9.8 的 RK3588 开发板

2. 软件工具

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

三、获取.safetensors 模型权重

四、safetensors 转 RKLLM

1. 转换环境搭建

2. 模型转换

五、RKLLM 模型板端部署及推理

六、集成开源 gradio 工具实现 web 访问

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具