NVIDIA RTX PC 开源 AI 工具升级：LLM 与扩散模型性能优化 | 极客日志

PythonAI算法

NVIDIA RTX PC 开源 AI 工具升级：LLM 与扩散模型性能优化

综述由AI生成NVIDIA RTX PC 开源 AI 工具集迎来重大更新，涵盖 ComfyUI 扩散模型量化、llama.cpp 与 Ollama 的 LLM 加速、LTX-2 音视频生成及 Nemotron 智能体微调。通过 NVFP4/FP8 量化、GPU Token 采样、Flash Attention 等技术，显著降低显存占用并提升推理吞吐量，使开发者能在本地构建高性能 AI 应用。ComfyUI 支持 NVFP4 获得 3-4 倍性能提升，llama.cpp 实现 MoE 模型 35% 吞吐增长，LTX-2 支持 4K 50fps 本地生成，Docling 提供 4 倍 OCR 加速。这些优化降低了硬件门槛，推动 AI PC 生态发展。

林间仙子发布于 2026/3/22更新于 2026/5/2215 浏览

NVIDIA RTX PC 开源 AI 工具升级：LLM 与扩散模型性能优化

AI 工具生态更新

PC 端的 AI 开发活动正在经历显著增长。小型语言模型（SLMs）和扩散模型质量的提升，如 FLUX.2、GPT-OSS-20B 和 Nemotron 3 Nano 等，推动了这一趋势。与此同时，ComfyUI、llama.cpp、Ollama 和 Unsloth 等框架也在不断升级，开发者不再仅仅是在实验生成式 AI 工作流，而是在 NVIDIA GPU 上构建下一代软件栈。

在近期的技术更新中，NVIDIA 针对 AI PC 开发者生态系统发布了多项重要改进，涵盖从底层框架优化到高层应用工具的全方位升级。本文将深入探讨这些开源工具的最新性能提升，并提供详细的代码示例，帮助开发者充分利用 NVIDIA RTX 平台的算力。

ComfyUI 的持续性能改进：扩散模型加速的新里程碑

ComfyUI 作为扩散模型领域最受欢迎的开源框架之一，在 NVIDIA 的协作下实现了显著的性能突破。通过 PyTorch-CUDA 的深度优化，ComfyUI 现已支持 NVFP4 和 FP8 量化格式，这些量化格式分别实现了 60% 和 40% 的显存节省，同时大幅提升了推理性能。开发者使用 NVFP4 格式可获得平均 3 倍的性能提升，使用 NVFP8 格式则可获得 2 倍的性能提升。

ComfyUI 核心优化特性详解

ComfyUI 的最新更新包含了多项关键优化技术，每一项都针对扩散模型推理的特定瓶颈进行了精准优化。

NVFP4 支持：线性层可以使用 NVFP4 格式运行，配合优化的内核实现，相比 FP16 和 BF16 线性层可提供 3-4 倍的吞吐量提升。这一优化特别适合大规模扩散 Transformer 模型。

融合 FP8 量化/反量化内核：通过消除内存带宽受限的操作，这些融合内核显著提升了模型性能。对于没有第四代 Tensor Core 的 NVIDIA RTX GPU（Ada 架构之前），FP8 工作负载的性能得到了进一步改善。

权重流式传输：利用并发的系统内存和 CPU 计算流，权重流式传输技术可以隐藏内存延迟并提高吞吐量，特别适合 VRAM 有限的 GPU。

混合精度支持：模型可以在单个网络中组合多种数值格式，实现精细化调优以获得最佳的准确性和性能平衡。

RMS 和 RoPE 融合：扩散 Transformer 中常见的内存带宽受限算子被融合，减少了内存使用和延迟。这一优化惠及所有数据类型的 DiT 模型。

ComfyUI NVFP4 量化工作流代码示例

下面展示如何在 ComfyUI 中使用 NVFP4 量化格式加载和运行扩散模型。这里有个细节需要注意，配置量化参数时要确保回退数据类型设置合理，防止非量化层出现精度问题。

import torch
import comfy.model_management as mm
from comfy.sd import load_checkpoint_guess_config
import comfy.utils

# 配置 NVFP4 量化参数
# enable_nvfp4: 启用 NVFP4 量化格式
# nvfp4_linear_only: 仅对线性层应用 NVFP4 量化，保持其他层使用原始精度
quantization_config = {
    'enable_nvfp4': True,  # 启用 NVFP4 量化以获得 3-4x 性能提升
    : ,  
    : torch.float16  
}



checkpoint_path = 



model, clip, vae, clip_vision = load_checkpoint_guess_config(
    checkpoint_path,
    output_vae=,  
    output_clip=,  
    embedding_directory=,  
    quantization=quantization_config  
)


device = mm.get_torch_device()  
model.to(device)
model.()  


prompt_text = 




tokens = clip.tokenize(prompt_text)
cond, pooled = clip.encode_from_tokens(tokens, return_pooled=)





sampling_params = {
    : ,  
    : ,  
    : ,  
    : ,  
    :   
}



batch_size = 
latent_height =   
latent_width =   
latent_channels =   
latent_image = torch.randn(
    batch_size, latent_channels, latent_height, latent_width,
    device=device, dtype=torch.float16
)



()
 torch.inference_mode():  
    
    samples = comfy.sample.sample(
        model, noise=latent_image, positive=cond, negative=,  
        cfg=sampling_params[],
        steps=sampling_params[],
        sampler_name=sampling_params[]
    )



decoded_images = vae.decode(samples)



images = (decoded_images + ) /   
images = torch.clamp(images, , )  
images = (images * ).to(torch.uint8)  


 PIL  Image
 numpy  np
 i, img_tensor  (images):
    
    img_np = img_tensor.permute(, , ).cpu().numpy()
    img_pil = Image.fromarray(img_np)
    img_pil.save()
    ()
()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

import torch
from comfy.model_patcher import ModelPatcher
import comfy.model_management as mm

# 定义混合精度策略
# 不同层使用不同的精度以优化性能和质量
mixed_precision_config = {
    # 注意力层使用 FP16 以保持质量
    'attention_layers': {
        'dtype': torch.float16,
        'quantize': False  # 不量化注意力层
    },
    # 线性层使用 NVFP4 以最大化性能
    'linear_layers': {
        'dtype': 'nvfp4',
        'quantize': True,
        'calibration': 'minmax'  # 使用 min-max 校准
    },
    # 卷积层使用 FP8 平衡性能和质量
    'conv_layers': {
        'dtype': 'fp8',
        'quantize': True,
        'calibration': 'histogram'  # 使用直方图校准
    },
    # 归一化层保持 FP32 以确保数值稳定性
    'norm_layers': {
        'dtype': torch.float32,
        'quantize': False
    }
}

def apply_mixed_precision(model, config):
    """ 应用混合精度配置到模型
    参数:
        model: ComfyUI 模型对象
        config: 混合精度配置字典
    返回:
        patched_model: 应用混合精度后的模型
    """
    # 创建模型补丁器用于动态修改模型
    patcher = ModelPatcher(model)
    # 遍历模型的所有模块
    for name, module in model.named_modules():
        # 根据模块类型应用相应的精度配置
        if 'attn' in name.lower():  # 注意力模块
            layer_config = config['attention_layers']
            if not layer_config['quantize']:  # 转换为指定的浮点精度
                module.to(dtype=layer_config['dtype'])
                print(f"注意力层 {name} 使用 {layer_config['dtype']}")
        elif isinstance(module, torch.nn.Linear):  # 线性层
            layer_config = config['linear_layers']
            if layer_config['quantize']:  # 应用 NVFP4 量化
                if layer_config['dtype'] == 'nvfp4':  # 使用 NVFP4 量化线性层权重
                    quantize_linear_nvfp4(module, layer_config['calibration'])
                    print(f"线性层 {name} 量化为 NVFP4")
        elif isinstance(module, (torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d)):  # 卷积层
            layer_config = config['conv_layers']
            if layer_config['quantize'] and layer_config['dtype'] == 'fp8':  # 应用 FP8 量化
                quantize_conv_fp8(module, layer_config['calibration'])
                print(f"卷积层 {name} 量化为 FP8")
        elif isinstance(module, (torch.nn.LayerNorm, torch.nn.GroupNorm, torch.nn.BatchNorm2d)):  # 归一化层
            layer_config = config['norm_layers']
            module.to(dtype=layer_config['dtype'])
            print(f"归一化层 {name} 使用 {layer_config['dtype']}")
    return patcher

def quantize_linear_nvfp4(linear_module, calibration='minmax'):
    """ 将线性层量化为 NVFP4 格式
    参数:
        linear_module: torch.nn.Linear 模块
        calibration: 校准方法 ('minmax' 或 'histogram')
    """
    weight = linear_module.weight.data
    if calibration == 'minmax':  # Min-Max 校准：使用权重的最小值和最大值
        w_min = weight.min()
        w_max = weight.max()
        scale = (w_max - w_min) / 15.0  # NVFP4 有 16 个量化级别 (0-15)
        zero_point = -w_min / scale
    elif calibration == 'histogram':  # 直方图校准：使用权重分布的百分位数
        w_min = torch.quantile(weight, 0.01)  # 1% 百分位
        w_max = torch.quantile(weight, 0.99)  # 99% 百分位
        scale = (w_max - w_min) / 15.0
        zero_point = -w_min / scale
    # 执行量化
    # 公式：q = round((w - zero_point) / scale)
    quantized = torch.round((weight - w_min) / scale).clamp(0, 15)
    # 存储量化参数
    linear_module.weight_scale = scale
    linear_module.weight_zero_point = zero_point
    linear_module.weight_quantized = quantized.to(torch.uint8)
    # 标记为 NVFP4 量化
    linear_module.quantization_type = 'nvfp4'

def quantize_conv_fp8(conv_module, calibration='histogram'):
    """ 将卷积层量化为 FP8 格式
    参数:
        conv_module: torch.nn.Conv 模块
        calibration: 校准方法
    """
    weight = conv_module.weight.data
    # FP8 量化使用动态范围缩放
    if calibration == 'histogram':  # 基于直方图的动态范围确定
        abs_max = torch.quantile(weight.abs(), 0.99)
    else:  # 简单的最大值方法
        abs_max = weight.abs().max()
    # FP8 E4M3 格式的最大值约为 448
    fp8_max = 448.0
    scale = fp8_max / abs_max
    # 量化到 FP8 范围
    quantized = (weight * scale).clamp(-fp8_max, fp8_max)
    # 存储量化参数
    conv_module.weight_scale = scale
    conv_module.weight_quantized = quantized
    conv_module.quantization_type = 'fp8'

# 使用示例
print("应用混合精度配置...")
patched_model = apply_mixed_precision(model, mixed_precision_config)
print("混合精度配置完成！")

import torch
import threading
from queue import Queue

class WeightStreamer:
    """ 权重流式传输管理器
    利用并发的系统内存和 CPU 计算流隐藏内存延迟
    """
    def __init__(self, model, device, stream_buffer_size=2):
        """ 初始化权重流式传输器
        参数:
            model: 要流式传输的模型
            device: 目标 GPU 设备
            stream_buffer_size: 流式缓冲区大小（层数）
        """
        self.model = model
        self.device = device
        self.stream_buffer_size = stream_buffer_size
        # 创建 CUDA 流用于异步传输
        self.compute_stream = torch.cuda.Stream(device=device)
        self.transfer_stream = torch.cuda.Stream(device=device)
        # 权重缓冲队列
        self.weight_queue = Queue(maxsize=stream_buffer_size)
        # 将模型权重保存在 CPU 内存中
        self.cpu_weights = {}
        for name, param in model.named_parameters():
            self.cpu_weights[name] = param.data.cpu().pin_memory()
        # 使用固定内存加速传输
        param.data = torch.empty(0)  # 释放 GPU 内存
        print(f"权重流式传输器初始化完成，缓冲区大小：{stream_buffer_size}")

    def prefetch_weights(self, layer_names):
        """ 预取指定层的权重到 GPU
        参数:
            layer_names: 要预取的层名称列表
        """
        def transfer_worker():
            for name in layer_names:
                if name in self.cpu_weights:
                    # 在传输流中异步传输权重
                    with torch.cuda.stream(self.transfer_stream):
                        gpu_weight = self.cpu_weights[name].to(
                            self.device, non_blocking=True  # 非阻塞传输
                        )
                        self.weight_queue.put((name, gpu_weight))
        # 在后台线程中启动传输
        transfer_thread = threading.Thread(target=transfer_worker)
        transfer_thread.start()

    def get_weight(self, layer_name):
        """ 获取层权重（如果尚未加载则等待）
        参数:
            layer_name: 层名称
        返回:
            weight: GPU 上的权重张量
        """
        # 从队列中获取预取的权重
        name, weight = self.weight_queue.get()
        # 确保传输完成
        self.transfer_stream.synchronize()
        return weight

    def forward_with_streaming(self, x, layer_sequence):
        """ 使用权重流式传输执行前向传播
        参数:
            x: 输入张量
            layer_sequence: 层序列（按执行顺序）
        返回:
            output: 输出张量
        """
        # 预取前几层的权重
        initial_layers = layer_sequence[:self.stream_buffer_size]
        self.prefetch_weights(initial_layers)
        output = x
        for i, layer in enumerate(layer_sequence):
            # 预取下一批权重
            if i + self.stream_buffer_size < len(layer_sequence):
                next_layer = layer_sequence[i + self.stream_buffer_size]
                self.prefetch_weights([next_layer])
            # 获取当前层权重
            weight = self.get_weight(layer.name)
            # 在计算流中执行层计算
            with torch.cuda.stream(self.compute_stream):
                # 临时恢复权重
                original_weight = layer.weight.data
                layer.weight.data = weight
                # 执行层计算
                output = layer(output)
                # 清理权重（可选，取决于内存策略）
                layer.weight.data = original_weight
            # 同步计算流
            self.compute_stream.synchronize()
        return output

# 使用权重流式传输的示例
print("初始化权重流式传输...")
streamer = WeightStreamer(
    model=model,
    device=device,
    stream_buffer_size=3  # 同时缓冲 3 层权重
)
# 定义层执行序列
layer_sequence = [
    model.input_blocks[0],
    model.input_blocks[1],
    model.middle_block,
    # ... 更多层
]
# 执行流式推理
print("使用权重流式传输执行推理...")
with torch.inference_mode():
    output = streamer.forward_with_streaming(latent_image, layer_sequence)
print("流式推理完成！")

#include"llama.h"
#include"common.h"
#include<vector>
#include<string>
#include<iostream>

// GPU Token 采样配置结构
struct GPUSamplingConfig {
    // TopK 采样：从概率最高的 K 个 token 中采样
    int top_k = 40;  // K 值，0 表示禁用
    // TopP 采样（核采样）：从累积概率达到 P 的最小 token 集合中采样
    float top_p = 0.95f;  // P 值，1.0 表示禁用
    // Temperature 采样：控制分布的随机性
    float temperature = 0.8f;  // 温度值，1.0 为原始分布，<1 更确定，>1 更随机
    // MinK 采样：确保至少考虑 K 个 token
    int min_k = 1;  // 最小 K 值
    // MinP 采样：只考虑概率大于 P*max_prob 的 token
    float min_p = 0.05f;  // 最小概率阈值
    // 重复惩罚：降低已生成 token 的概率
    float repeat_penalty = 1.1f;  // 惩罚系数，1.0 表示无惩罚
    int repeat_last_n = 64;  // 考虑最近 N 个 token 进行惩罚
    // GPU 采样开关
    bool use_gpu_sampling = true;  // 启用 GPU 采样以获得更好的性能
    // 批处理大小（用于多序列采样）
    int n_batch = 512;  // 批处理 token 数量
    int n_parallel = 4;  // 并行序列数量
};

// 初始化 llama.cpp 上下文并启用 GPU 采样
class LlamaCppGPUSampler {
private:
    llama_model* model;
    llama_context* ctx;
    GPUSamplingConfig config;
public:
    LlamaCppGPUSampler(const std::string& model_path, const GPUSamplingConfig& cfg) : config(cfg) {
        // 设置模型参数
        llama_model_params model_params = llama_model_default_params();
        model_params.n_gpu_layers = 99;  // 将所有层卸载到 GPU
        model_params.use_mmap = true;  // 使用内存映射加速加载
        model_params.use_mlock = false;  // 不锁定内存（允许交换）
        // 加载模型
        std::cout << "加载模型：" << model_path << std::endl;
        model = llama_load_model_from_file(model_path.c_str(), model_params);
        if (!model) {
            throw std::runtime_error("无法加载模型");
        }
        // 设置上下文参数
        llama_context_params ctx_params = llama_context_default_params();
        ctx_params.n_ctx = 4096;  // 上下文窗口大小
        ctx_params.n_batch = config.n_batch;  // 批处理大小
        ctx_params.n_parallel = config.n_parallel;  // 并行序列数
        ctx_params.n_threads = 8;  // CPU 线程数
        ctx_params.n_threads_batch = 8;  // 批处理线程数
        // 关键：启用 GPU 采样
        ctx_params.offload_kqv = true;  // 将 KQV 卸载到 GPU
        ctx_params.flash_attn = true;  // 启用 Flash Attention
        // 创建上下文
        std::cout << "创建上下文（启用 GPU 采样）..." << std::endl;
        ctx = llama_new_context_with_model(model, ctx_params);
        if (!ctx) {
            llama_free_model(model);
            throw std::runtime_error("无法创建上下文");
        }
        std::cout << "GPU 采样初始化完成！" << std::endl;
    }
    ~LlamaCppGPUSampler() {
        if (ctx) llama_free(ctx);
        if (model) llama_free_model(model);
    }

    // 使用 GPU 采样生成文本
    std::string generate(const std::string& prompt, int max_tokens = 512) {
        // 对提示词进行分词
        std::vector<llama_token> tokens;
        tokens.resize(prompt.size() + 1);
        int n_tokens = llama_tokenize(
            model, prompt.c_str(), prompt.size(), tokens.data(), tokens.size(),
            true,  // add_bos: 添加序列开始标记
            false  // special: 不解析特殊 token
        );
        tokens.resize(n_tokens);
        std::cout << "提示词 token 数：" << n_tokens << std::endl;
        // 准备批处理
        llama_batch batch = llama_batch_init(config.n_batch, 0, config.n_parallel);
        // 添加提示词 token 到批处理
        for (size_t i = 0; i < tokens.size(); i++) {
            llama_batch_add(batch, tokens[i], i, {0}, false);
        }
        // 最后一个 token 需要生成 logits
        batch.logits[batch.n_tokens - 1] = true;
        // 评估提示词
        if (llama_decode(ctx, batch) != 0) {
            llama_batch_free(batch);
            throw std::runtime_error("llama_decode 失败");
        }
        // 生成循环
        std::string generated_text;
        int n_cur = batch.n_tokens;
        int n_decode = 0;
        std::cout << "开始生成（使用 GPU 采样）..." << std::endl;
        while (n_decode < max_tokens) {
            // 获取 logits（已在 GPU 上）
            float* logits = llama_get_logits_ith(ctx, batch.n_tokens - 1);
            int n_vocab = llama_n_vocab(model);
            // 在 GPU 上执行采样
            llama_token new_token = sample_token_gpu(logits, n_vocab);
            // 检查是否为结束标记
            if (new_token == llama_token_eos(model)) {
                std::cout << "\n遇到 EOS 标记，生成结束" << std::endl;
                break;
            }
            // 将 token 转换为文本
            char buf[256];
            int n = llama_token_to_piece(model, new_token, buf, sizeof(buf));
            if (n > 0) {
                generated_text.append(buf, n);
                std::cout << std::string(buf, n) << std::flush;
            }
            // 准备下一次解码
            llama_batch_clear(batch);
            llama_batch_add(batch, new_token, n_cur, {0}, true);
            n_cur++;
            n_decode++;
            // 解码下一个 token
            if (llama_decode(ctx, batch) != 0) {
                std::cout << "\nllama_decode 失败" << std::endl;
                break;
            }
        }
        llama_batch_free(batch);
        std::cout << "\n生成完成！总共生成 " << n_decode << " 个 token" << std::endl;
        return generated_text;
    }

private:
    // 在 GPU 上执行 token 采样
    llama_token sample_token_gpu(float* logits, int n_vocab) {
        // 创建采样候选数组
        std::vector<llama_token_data> candidates;
        candidates.reserve(n_vocab);
        for (int i = 0; i < n_vocab; i++) {
            candidates.push_back({i, logits[i], 0.0f});
        }
        llama_token_data_array candidates_p = {
            candidates.data(), candidates.size(), false  // sorted: 尚未排序
        };
        // 应用重复惩罚（在 GPU 上执行）
        if (config.repeat_penalty != 1.0f) {
            // 获取最近的 token 用于惩罚
            std::vector<llama_token> last_tokens(config.repeat_last_n);
            // ... 填充 last_tokens ...
            llama_sample_repetition_penalties(
                ctx, &candidates_p, last_tokens.data(), last_tokens.size(),
                config.repeat_penalty,  // repeat_penalty
                0.0f,  // alpha_frequency
                0.0f   // alpha_presence
            );
        }
        // 应用 Temperature（在 GPU 上执行）
        if (config.temperature != 1.0f) {
            llama_sample_temp(ctx, &candidates_p, config.temperature);
        }
        // 应用 MinP 过滤（在 GPU 上执行）
        if (config.min_p > 0.0f) {
            llama_sample_min_p(ctx, &candidates_p, config.min_p, config.min_k);
        }
        // 应用 TopK 采样（在 GPU 上执行）
        if (config.top_k > 0) {
            llama_sample_top_k(ctx, &candidates_p, config.top_k, config.min_k);
        }
        // 应用 TopP 采样（在 GPU 上执行）
        if (config.top_p < 1.0f) {
            llama_sample_top_p(ctx, &candidates_p, config.top_p, config.min_k);
        }
        // 从候选中采样 token（在 GPU 上执行）
        llama_token token = llama_sample_token(ctx, &candidates_p);
        return token;
    }
};

// 主函数示例
int main() {
    // 配置 GPU 采样参数
    GPUSamplingConfig config;
    config.top_k = 40;
    config.top_p = 0.95f;
    config.temperature = 0.8f;
    config.repeat_penalty = 1.1f;
    config.use_gpu_sampling = true;  // 启用 GPU 采样
    try {
        // 初始化采样器
        // 支持的模型：Nemotron Nano V2, Qwen 3 30B, GPT-OSS-20B 等
        LlamaCppGPUSampler sampler("/path/to/nemotron-nano-v2-q4_k_m.gguf", config);
        // 生成文本
        std::string prompt = "Explain quantum computing in simple terms:";
        std::string generated = sampler.generate(prompt, 512);
        std::cout << "\n\n=== 完整生成结果 ===" << std::endl;
        std::cout << prompt << generated << std::endl;
    } catch (const std::exception& e) {
        std::cerr << "错误：" << e.what() << std::endl;
        return 1;
    }
    return 0;
}

#!/bin/bash
# llama.cpp 编译脚本（启用 CUDA 图优化和 QKV 并发）
# 设置环境变量以启用优化特性
export GGML_CUDA_GRAPH_OPT=1  # 启用 CUDA 图优化
export GGML_CUDA_FA_ALL_QUANTS=1  # 对所有量化格式启用 Flash Attention
export CUDA_LAUNCH_BLOCKING=0  # 允许异步 CUDA 操作

# 编译 llama.cpp（启用 CUDA 和优化）
cd /path/to/llama.cpp
mkdir -p build && cd build
cmake .. \
    -DLLAMA_CUBLAS=ON \
    # 启用 CUDA 支持
    -DLLAMA_CUDA_FA=ON \
    # 启用 Flash Attention
    -DLLAMA_CUDA_GRAPHS=ON \
    # 启用 CUDA 图
    -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=128 \
    # 设置对等传输批大小
    -DCMAKE_BUILD_TYPE=Release  # 发布模式
make -j$(nproc)
echo "llama.cpp 编译完成（已启用 CUDA 图和 QKV 并发）"

# 运行推理（使用优化标志）
./bin/main \
    -m /path/to/model.gguf \
    -p "Explain the theory of relativity:" \
    -n 512 \
    --n-gpu-layers 99 \
    # 将所有层卸载到 GPU
    --batch-size 512 \
    # 批处理大小
    --ctx-size 4096 \
    # 上下文大小
    --threads 8 \
    # CPU 线程数
    --backend-sampling \
    # 启用后端采样（GPU 采样）
    --flash-attn \
    # 启用 Flash Attention
    --temp 0.8 \
    # 温度
    --top-k 40 \
    # TopK
    --top-p 0.95 \
    # TopP
    --repeat-penalty 1.1  # 重复惩罚

import os
import ctypes
from llama_cpp import Llama

# 设置环境变量以启用 CUDA 图优化
os.environ['GGML_CUDA_GRAPH_OPT'] = '1'
os.environ['GGML_CUDA_FA_ALL_QUANTS'] = '1'

# 初始化 llama.cpp（启用所有优化）
llm = Llama(
    model_path="/path/to/nemotron-nano-v2-q4_k_m.gguf",
    n_gpu_layers=99,  # 将所有层卸载到 GPU
    n_ctx=4096,  # 上下文窗口大小
    n_batch=512,  # 批处理大小
    n_threads=8,  # CPU 线程数
    use_mmap=True,  # 使用内存映射
    use_mlock=False,  # 不锁定内存
    # 启用 Flash Attention 和 CUDA 图
    flash_attn=True,  # Flash Attention
    offload_kqv=True,  # 将 KQV 卸载到 GPU
    # GPU 采样配置
    logits_all=False,  # 只计算最后一个 token 的 logits
    vocab_only=False,  # 加载完整模型
    verbose=True  # 显示详细信息
)
print("llama.cpp 初始化完成（已启用 CUDA 图和 QKV 并发）")

# 生成文本（使用 GPU 采样）
prompt = """You are a helpful AI assistant. Answer the following question: Question: What are the key differences between quantum computing and classical computing? Answer:"""
print("开始生成（使用 GPU 采样和 CUDA 图优化）...")
# 使用流式生成以获得更好的用户体验
output = llm(
    prompt,
    max_tokens=512,  # 最大生成 token 数
    temperature=0.8,  # 温度
    top_k=40,  # TopK 采样
    top_p=0.95,  # TopP 采样
    repeat_penalty=1.1,  # 重复惩罚
    stop=["Question:", "\n\n"],  # 停止序列
    stream=True  # 启用流式输出
)

# 打印流式输出
full_response = ""
for chunk in output:
    text = chunk['choices'][0]['text']
    print(text, end='', flush=True)
    full_response += text

print("\n\n生成完成！")
# 性能统计
print(f"\n性能统计:")
print(f"- 提示词 token 数：{chunk['usage']['prompt_tokens']}")
print(f"- 生成 token 数：{chunk['usage']['completion_tokens']}")
print(f"- 总 token 数：{chunk['usage']['total_tokens']}")

import requests
import json

# Ollama API 端点
OLLAMA_API = "http://localhost:11434/api"

# 配置 Ollama 以使用 Flash Attention 和优化的内存管理
def configure_ollama_optimizations():
    """ 配置 Ollama 以启用所有性能优化 """
    # Ollama 配置文件路径（根据操作系统不同）
    # Linux: ~/.ollama/config.json
    # Windows: %USERPROFILE%\.ollama\config.json
    config = {
        # Flash Attention 配置（默认启用）
        "flash_attention": {
            "enabled": True,  # 启用 Flash Attention
            "tile_size": 128,  # 瓦片大小，影响内存和性能平衡
            "use_cuda_graphs": True  # 使用 CUDA 图优化
        },
        # 内存管理配置
        "memory": {
            "gpu_memory_fraction": 0.9,  # 分配 90% 的 GPU 内存给 Ollama
            "enable_unified_memory": True,  # 启用统一内存（GPU+CPU）
            "offload_layers": "auto"  # 自动决定层卸载策略
        },
        # 性能优化
        "performance": {
            "num_parallel": 4,  # 并行请求数
            "num_thread": 8,  # CPU 线程数
            "batch_size": 512,  # 批处理大小
            "use_mmap": True  # 使用内存映射
        },
        # GGML 库优化
        "ggml": {
            "cuda_graph_opt": True,  # CUDA 图优化
            "fa_all_quants": True  # 对所有量化格式启用 FA
        }
    }
    print("Ollama 优化配置:")
    print(json.dumps(config, indent=2))
    return config

# 使用 LogProbs API 进行高级采样
def generate_with_logprobs(prompt, model="nemotron-nano:latest"):
    """ 使用 Ollama 的 LogProbs API 生成文本
    LogProbs 可用于分类、困惑度计算和自我评估
    参数:
        prompt: 输入提示词
        model: 模型名称
    返回:
        response: 包含生成文本和 logprobs 的响应
    """
    url = f"{OLLAMA_API}/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False,  # 非流式以获取完整 logprobs
        # 采样参数
        "options": {
            "temperature": 0.8,
            "top_k": 40,
            "top_p": 0.95,
            "repeat_penalty": 1.1,
            "num_predict": 512,  # 最大生成 token 数
            # 启用 LogProbs
            "num_ctx": 4096,  # 上下文大小
            "num_batch": 512,  # 批处理大小
            "num_gpu": 99,  # GPU 层数
            "num_thread": 8,  # CPU 线程数
            # Flash Attention（默认启用）
            "flash_attn": True,
            # 返回 logprobs
            "logprobs": True,  # 启用 logprobs 返回
            "top_logprobs": 5  # 返回前 5 个 token 的 logprobs
        }
    }
    print(f"发送请求到 Ollama API...")
    response = requests.post(url, json=payload)
    if response.status_code == 200:
        result = response.json()
        return result
    else:
        raise Exception(f"API 请求失败：{response.status_code} - {response.text}")

# 使用 LogProbs 进行文本分类
def classify_with_logprobs(text, categories, model="nemotron-nano:latest"):
    """ 使用 LogProbs 进行零样本文本分类
    参数:
        text: 要分类的文本
        categories: 类别列表
        model: 模型名称
    返回:
        classification: 分类结果和置信度
    """
    # 构建分类提示词
    prompt = f"""Classify the following text into one of these categories: {', '.join(categories)} Text: {text} Category:"""
    # 生成并获取 logprobs
    result = generate_with_logprobs(prompt, model)
    # 提取生成的类别和 logprobs
    generated_text = result['response'].strip()
    # 如果 API 返回了 logprobs，计算每个类别的概率
    if 'logprobs' in result:
        logprobs = result['logprobs']
        # 找到与类别匹配的 token
        category_probs = {}
        for category in categories:
            # 计算类别的对数概率
            category_lower = category.lower()
            if category_lower in generated_text.lower():
                # 从 logprobs 中提取概率
                prob = 0.0
                if 'token_logprobs' in logprobs:
                    # 计算平均对数概率
                    token_logprobs = logprobs['token_logprobs']
                    if len(token_logprobs) > 0:
                        prob = sum(token_logprobs) / len(token_logprobs)
                category_probs[category] = prob
        # 找到概率最高的类别
        if category_probs:
            best_category = max(category_probs, key=category_probs.get)
            confidence = category_probs[best_category]
        else:
            best_category = generated_text
            confidence = 0.0
    else:
        best_category = generated_text
        confidence = 0.0
    return {'category': best_category, 'confidence': confidence, 'all_probabilities': category_probs if 'logprobs' in result else {}}

# 计算困惑度（Perplexity）
def calculate_perplexity(text, model="nemotron-nano:latest"):
    """ 使用 LogProbs 计算文本的困惑度
    困惑度是语言模型质量的重要指标
    参数:
        text: 输入文本
        model: 模型名称
    返回:
        perplexity: 困惑度值
    """
    import math
    # 使用模型评估文本
    result = generate_with_logprobs(text, model)
    if 'logprobs' in result and 'token_logprobs' in result['logprobs']:
        token_logprobs = result['logprobs']['token_logprobs']
        # 计算平均负对数似然
        avg_neg_log_likelihood = -sum(token_logprobs) / len(token_logprobs)
        # 困惑度 = exp(平均负对数似然)
        perplexity = math.exp(avg_neg_log_likelihood)
        return {'perplexity': perplexity, 'avg_log_likelihood': -avg_neg_log_likelihood, 'num_tokens': len(token_logprobs)}
    else:
        return {'perplexity': None, 'error': 'LogProbs not available'}

# 主函数示例
if __name__ == "__main__":
    # 配置优化
    print("=== 配置 Ollama 优化 ===")
    config = configure_ollama_optimizations()
    # 示例 1：基本生成
    print("\n=== 示例 1：基本文本生成 ===")
    prompt = "Explain the concept of machine learning in simple terms:"
    result = generate_with_logprobs(prompt)
    print(f"生成结果：{result['response']}")
    # 示例 2：文本分类
    print("\n=== 示例 2：使用 LogProbs 进行文本分类 ===")
    text_to_classify = "The stock market reached new highs today as investors reacted positively to economic data."
    categories = ["Technology", "Finance", "Sports", "Politics", "Entertainment"]
    classification = classify_with_logprobs(text_to_classify, categories)
    print(f"分类结果：{classification['category']}")
    print(f"置信度：{classification['confidence']:.4f}")
    print(f"所有概率：{classification['all_probabilities']}")
    # 示例 3：困惑度计算
    print("\n=== 示例 3：计算文本困惑度 ===")
    sample_text = "The quick brown fox jumps over the lazy dog."
    perplexity_result = calculate_perplexity(sample_text)
    print(f"困惑度：{perplexity_result['perplexity']:.2f}")
    print(f"平均对数似然：{perplexity_result['avg_log_likelihood']:.4f}")
    print(f"Token 数：{perplexity_result['num_tokens']}")

import torch
import torchaudio
import torchvision
from diffusers import DiffusionPipeline
from transformers import CLIPTextModel, CLIPTokenizer
import numpy as np
from PIL import Image

class LTX2AudioVideoGenerator:
    """ LTX-2 音视频生成器
    支持 4K 分辨率、50fps 帧率和多模态控制
    """
    def __init__(self, model_path, use_fp8=True, device="cuda"):
        """ 初始化 LTX-2 生成器
        参数:
            model_path: 模型路径（BF16 或 NVFP8 格式）
            use_fp8: 是否使用 FP8 量化（节省 30% 内存）
            device: 计算设备
        """
        self.device = device
        self.use_fp8 = use_fp8
        print(f"加载 LTX-2 模型（{'NVFP8' if use_fp8 else 'BF16'} 格式）...")
        # 加载 LTX-2 管道
        # LTX-2 是一个音视频联合生成模型
        self.pipeline = DiffusionPipeline.from_pretrained(
            model_path,
            torch_dtype=torch.float8_e4m3fn if use_fp8 else torch.bfloat16,
            variant="fp8" if use_fp8 else None,
            use_safetensors=True
        )
        # 启用内存优化
        self.pipeline.enable_model_cpu_offload()  # CPU 卸载以节省 VRAM
        self.pipeline.enable_vae_slicing()  # VAE 切片以减少内存峰值
        self.pipeline.enable_attention_slicing()  # 注意力切片
        # 如果可用，启用 xFormers 优化
        try:
            self.pipeline.enable_xformers_memory_efficient_attention()
            print("已启用 xFormers 内存高效注意力")
        except:
            print("xFormers 不可用，使用标准注意力")
        self.pipeline = self.pipeline.to(device)
        print("LTX-2 模型加载完成！")

    def generate_video(
        self, prompt, audio_prompt=None, duration=10.0, resolution=(3840, 2160),  # 4K 分辨率
        fps=50, num_inference_steps=50, guidance_scale=7.5, seed=None
    ):
        """ 生成音视频内容
        参数:
            prompt: 视频描述文本
            audio_prompt: 音频描述文本（可选）
            duration: 视频时长（秒），最长 20 秒
            resolution: 视频分辨率（宽，高）
            fps: 帧率，最高 50fps
            num_inference_steps: 去噪步数
            guidance_scale: 引导强度
            seed: 随机种子
        返回:
            video: 视频张量 (T, C, H, W)
            audio: 音频张量 (C, T)
        """
        # 验证参数
        if duration > 20.0:
            print("警告：时长超过 20 秒，将截断到 20 秒")
            duration = 20.0
        if fps > 50:
            print("警告：帧率超过 50fps，将限制到 50fps")
            fps = 50
        # 计算总帧数
        num_frames = int(duration * fps)
        # 设置随机种子以确保可复现性
        if seed is not None:
            torch.manual_seed(seed)
            np.random.seed(seed)
        print(f"生成参数:")
        print(f" - 提示词：{prompt}")
        print(f" - 音频提示词：{audio_prompt or '(无)'}")
        print(f" - 时长：{duration}秒")
        print(f" - 分辨率：{resolution[0]}x{resolution[1]}")
        print(f" - 帧率：{fps}fps")
        print(f" - 总帧数：{num_frames}")
        print(f" - 去噪步数：{num_inference_steps}")
        # 准备多模态输入
        # LTX-2 支持文本、图像和音频的联合控制
        inputs = {
            'prompt': prompt,
            'num_frames': num_frames,
            'height': resolution[1],
            'width': resolution[0],
            'num_inference_steps': num_inference_steps,
            'guidance_scale': guidance_scale,
            'output_type': 'pt'  # 返回 PyTorch 张量
        }
        # 如果提供了音频提示词，添加到输入中
        if audio_prompt:
            inputs['audio_prompt'] = audio_prompt
            inputs['audio_guidance_scale'] = guidance_scale * 0.8  # 音频引导稍弱
        print("开始生成音视频内容...")
        # 执行生成
        with torch.inference_mode():
            output = self.pipeline(**inputs)
            # 提取视频和音频
            video = output.frames  # (T, C, H, W)
            audio = output.audio if hasattr(output, 'audio') else None  # (C, T)
        print(f"生成完成！")
        print(f" - 视频形状：{video.shape}")
        if audio is not None:
            print(f" - 音频形状：{audio.shape}")
        return video, audio

    def generate_with_image_control(
        self, prompt, control_image, audio_prompt=None, duration=10.0, fps=50,
        controlnet_conditioning_scale=1.0
    ):
        """ 使用图像控制生成音视频
        LTX-2 支持多模态控制，可以使用参考图像引导生成
        参数:
            prompt: 文本提示词
            control_image: 控制图像（PIL Image 或 tensor）
            audio_prompt: 音频提示词
            duration: 视频时长
            fps: 帧率
            controlnet_conditioning_scale: 控制强度
        返回:
            video: 生成的视频
            audio: 生成的音频
        """
        # 预处理控制图像
        if isinstance(control_image, Image.Image):
            control_image = torchvision.transforms.ToTensor()(control_image)
            control_image = control_image.to(self.device)
        print(f"使用图像控制生成音视频...")
        print(f" - 控制图像形状：{control_image.shape}")
        print(f" - 控制强度：{controlnet_conditioning_scale}")
        # 准备输入
        num_frames = int(duration * fps)
        inputs = {
            'prompt': prompt,
            'image': control_image,
            'num_frames': num_frames,
            'num_inference_steps': 50,
            'guidance_scale': 7.5,
            'controlnet_conditioning_scale': controlnet_conditioning_scale,
            'output_type': 'pt'
        }
        if audio_prompt:
            inputs['audio_prompt'] = audio_prompt
        # 生成
        with torch.inference_mode():
            output = self.pipeline(**inputs)
        return output.frames, output.audio if hasattr(output, 'audio') else None

    def save_video(self, video, audio, output_path, fps=50):
        """ 保存音视频到文件
        参数:
            video: 视频张量 (T, C, H, W)
            audio: 音频张量 (C, T)
            output_path: 输出文件路径
            fps: 帧率
        """
        print(f"保存音视频到：{output_path}")
        # 转换视频格式
        # 从 (T, C, H, W) 转换到 (T, H, W, C)
        video_np = video.permute(0, 2, 3, 1).cpu().numpy()
        video_np = (video_np * 255).astype(np.uint8)
        # 使用 torchvision 保存视频
        if audio is not None:
            # 保存带音频的视频
            # 首先保存视频帧
            video_tensor = torch.from_numpy(video_np).permute(0, 3, 1, 2)
            torchvision.io.write_video(
                output_path, video_tensor, fps=fps, video_codec='h264',
                options={'crf': '18'}  # 高质量编码
            )
            # 然后添加音频轨道
            # 这里需要使用 ffmpeg 或其他工具合并音频
            audio_path = output_path.replace('.mp4', '_audio.wav')
            torchaudio.save(audio_path, audio.cpu(), sample_rate=48000)
            print(f" - 视频已保存：{output_path}")
            print(f" - 音频已保存：{audio_path}")
            print(f" - 使用 ffmpeg 合并：ffmpeg -i {output_path} -i {audio_path} -c copy output_with_audio.mp4")
        else:
            # 仅保存视频
            video_tensor = torch.from_numpy(video_np).permute(0, 3, 1, 2)
            torchvision.io.write_video(
                output_path, video_tensor, fps=fps, video_codec='h264',
                options={'crf': '18'}
            )
            print(f" - 视频已保存：{output_path}")

# 使用示例
if __name__ == "__main__":
    # 初始化生成器（使用 FP8 量化节省内存）
    generator = LTX2AudioVideoGenerator(
        model_path="/path/to/ltx2-fp8",
        use_fp8=True,  # 使用 FP8 量化，节省 30% 内存
        device="cuda"
    )
    # 示例 1：生成 4K 50fps 音视频
    print("\n=== 示例 1：生成 4K 50fps 音视频 ===")
    video, audio = generator.generate_video(
        prompt="A serene mountain landscape at sunrise, with mist rolling over peaks, cinematic lighting, 4K quality",
        audio_prompt="Gentle ambient music with nature sounds, peaceful morning atmosphere",
        duration=10.0,
        resolution=(3840, 2160),  # 4K
        fps=50,
        num_inference_steps=50,
        guidance_scale=7.5,
        seed=42
    )
    # 保存结果
    generator.save_video(video, audio, "mountain_sunrise_4k50.mp4", fps=50)
    # 示例 2：使用图像控制生成
    print("\n=== 示例 2：使用参考图像控制生成 ===")
    reference_image = Image.open("reference_frame.jpg")
    video2, audio2 = generator.generate_with_image_control(
        prompt="Continue this scene with dynamic camera movement, add flying birds",
        control_image=reference_image,
        audio_prompt="Add wind sounds and bird chirping",
        duration=15.0,
        fps=50,
        controlnet_conditioning_scale=0.8
    )
    generator.save_video(video2, audio2, "controlled_generation_4k50.mp4", fps=50)
    print("\n所有生成任务完成！")

import torch
from torch.utils.data import Dataset, DataLoader
import json
from pathlib import Path

class VideoPromptDataset(Dataset):
    """ 视频提示词数据集
    用于批量生成音视频内容
    """
    def __init__(self, prompts_file):
        """ 初始化数据集
        参数:
            prompts_file: JSON 格式的提示词文件
            格式：[{"prompt": "...", "audio_prompt": "...", "duration": 10.0}, ...]
        """
        with open(prompts_file, 'r', encoding='utf-8') as f:
            self.prompts = json.load(f)

    def __len__(self):
        return len(self.prompts)

    def __getitem__(self, idx):
        return self.prompts[idx]

def batch_generate_videos(generator, prompts_file, output_dir, batch_size=1,  # LTX-2 生成视频内存消耗大，通常 batch_size=1
                          resolution=(1920, 1080),  # 降低分辨率以支持批处理
                          fps=30):
    """ 批量生成音视频内容
    参数:
        generator: LTX2AudioVideoGenerator 实例
        prompts_file: 提示词文件路径
        output_dir: 输出目录
        batch_size: 批处理大小
        resolution: 视频分辨率
        fps: 帧率
    """
    # 创建输出目录
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    # 加载数据集
    dataset = VideoPromptDataset(prompts_file)
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=False, num_workers=0  # 视频生成不需要多进程加载
    )
    print(f"开始批量生成，共{len(dataset)}个视频...")
    # 批量生成
    for batch_idx, batch in enumerate(dataloader):
        print(f"\n处理批次 {batch_idx + 1}/{len(dataloader)}")
        for i, item in enumerate(batch):
            idx = batch_idx * batch_size + i
            # 提取参数
            prompt = item['prompt']
            audio_prompt = item.get('audio_prompt', None)
            duration = item.get('duration', 10.0)
            print(f"\n生成视频 {idx + 1}/{len(dataset)}")
            print(f"  提示词：{prompt[:50]}...")
            try:
                # 生成视频
                video, audio = generator.generate_video(
                    prompt=prompt,
                    audio_prompt=audio_prompt,
                    duration=duration,
                    resolution=resolution,
                    fps=fps,
                    num_inference_steps=40,  # 减少步数以加快生成
                    guidance_scale=7.5,
                    seed=42 + idx  # 不同的种子
                )
                # 保存视频
                output_file = output_path / f"video_{idx:04d}.mp4"
                generator.save_video(video, audio, str(output_file), fps=fps)
                print(f" ✓ 已保存：{output_file}")
                # 清理内存
                del video, audio
                torch.cuda.empty_cache()
            except Exception as e:
                print(f" ✗ 生成失败：{e}")
                continue
    print(f"\n批量生成完成！所有视频已保存到：{output_dir}")

# 使用示例
if __name__ == "__main__":
    # 准备提示词文件
    prompts = [
        {"prompt": "A futuristic cityscape at night with neon lights and flying cars", "audio_prompt": "Cyberpunk ambient music with electronic beats", "duration": 10.0},
        {"prompt": "Underwater coral reef with colorful fish swimming, sunlight filtering through water", "audio_prompt": "Underwater ambience with gentle water sounds", "duration": 15.0},
        {"prompt": "Time-lapse of a flower blooming, macro photography, beautiful colors", "audio_prompt": "Gentle classical music, peaceful atmosphere", "duration": 8.0}
    ]
    # 保存提示词到文件
    with open('video_prompts.json', 'w', encoding='utf-8') as f:
        json.dump(prompts, f, indent=2, ensure_ascii=False)
    # 初始化生成器
    generator = LTX2AudioVideoGenerator(
        model_path="/path/to/ltx2-fp8",
        use_fp8=True,
        device="cuda"
    )
    # 批量生成
    batch_generate_videos(
        generator=generator,
        prompts_file='video_prompts.json',
        output_dir='generated_videos',
        batch_size=1,
        resolution=(1920, 1080),  # Full HD
        fps=30
    )

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
import bitsandbytes as bnb

class NemotronNanoFineTuner:
    """ Nemotron 3 Nano 微调器
    使用 LoRA 进行参数高效微调
    """
    def __init__(self, model_name="nvidia/nemotron-3-nano-32b", use_4bit=True, device="cuda"):
        """ 初始化微调器
        参数:
            model_name: 模型名称或路径
            use_4bit: 是否使用 4 位量化（节省内存）
            device: 计算设备
        """
        self.device = device
        self.use_4bit = use_4bit
        print(f"加载 Nemotron 3 Nano 模型...")
        print(f" - 模型：{model_name}")
        print(f" - 4 位量化：{use_4bit}")
        # 配置 4 位量化
        if use_4bit:
            from transformers import BitsAndBytesConfig
            bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,  # 启用 4 位量化
                bnb_4bit_quant_type="nf4",  # 使用 NF4 量化类型
                bnb_4bit_compute_dtype=torch.bfloat16,  # 计算使用 BF16
                bnb_4bit_use_double_quant=True  # 双重量化以进一步节省内存
            )
        else:
            bnb_config = None
        # 加载模型
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            device_map="auto",  # 自动设备映射
            trust_remote_code=True,  # 信任远程代码
            torch_dtype=torch.bfloat16,  # 使用 BF16
            attn_implementation="flash_attention_2"  # 使用 Flash Attention 2
        )
        # 加载分词器
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name, trust_remote_code=True
        )
        # 设置 pad token
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        # 准备模型用于训练
        if use_4bit:
            self.model = prepare_model_for_kbit_training(self.model)
        print("模型加载完成！")
        print(f" - 参数总数：{sum(p.numel() for p in self.model.parameters()):,}")
        print(f" - 可训练参数：{sum(p.numel() for p in self.model.parameters() if p.requires_grad):,}")

    def setup_lora(self, r=16, lora_alpha=32, lora_dropout=0.05, target_modules=None):
        """ 配置 LoRA（低秩适应）
        参数:
            r: LoRA 秩，控制可训练参数数量
            lora_alpha: LoRA 缩放因子
            lora_dropout: LoRA dropout 率
            target_modules: 要应用 LoRA 的目标模块
        返回:
            model: 应用 LoRA 后的模型
        """
        print("配置 LoRA...")
        # 如果未指定目标模块，使用默认值
        if target_modules is None:
            # Nemotron 3 Nano 的注意力和 MLP 层
            target_modules = ["q_proj",  # Query 投影
                              "k_proj",  # Key 投影
                              "v_proj",  # Value 投影
                              "o_proj",  # Output 投影
                              "gate_proj",  # MoE 门控投影
                              "up_proj",  # MLP 上投影
                              "down_proj"]  # MLP 下投影
        # 配置 LoRA
        lora_config = LoraConfig(
            r=r,  # LoRA 秩
            lora_alpha=lora_alpha,  # LoRA alpha
            target_modules=target_modules,  # 目标模块
            lora_dropout=lora_dropout,  # Dropout
            bias="none",  # 不训练 bias
            task_type="CAUSAL_LM"  # 因果语言建模任务
        )
        # 应用 LoRA 到模型
        self.model = get_peft_model(self.model, lora_config)
        # 打印可训练参数统计
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        all_params = sum(p.numel() for p in self.model.parameters())
        trainable_percent = 100 * trainable_params / all_params
        print("LoRA 配置完成！")
        print(f" - LoRA 秩：{r}")
        print(f" - LoRA alpha: {lora_alpha}")
        print(f" - 目标模块：{target_modules}")
        print(f" - 可训练参数：{trainable_params:,} ({trainable_percent:.2f}%)")
        return self.model

    def prepare_dataset(self, dataset_name, text_column="text", max_length=2048, num_samples=None):
        """ 准备训练数据集
        参数:
            dataset_name: 数据集名称或路径
            text_column: 文本列名
            max_length: 最大序列长度
            num_samples: 使用的样本数（None 表示全部）
        返回:
            dataset: 处理后的数据集
        """
        print(f"加载数据集：{dataset_name}")
        # 加载数据集
        dataset = load_dataset(dataset_name, split="train")
        # 如果指定了样本数，进行采样
        if num_samples is not None:
            dataset = dataset.shuffle(seed=42).select(range(min(num_samples, len(dataset))))
        print(f"数据集大小：{len(dataset)}")
        # 定义格式化函数
        def format_instruction(example):
            """ 格式化指令数据为 Nemotron 格式 """
            # Nemotron 3 Nano 的指令格式
            # <|user|>\n{instruction}\n<|assistant|>\n{response}
            if "instruction" in example and "response" in example:
                text = f"<|user|>\n{example['instruction']}\n<|assistant|>\n{example['response']}"
            elif text_column in example:
                text = example[text_column]
            else:
                text = str(example)
            return {"text": text}
        # 应用格式化
        dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)
        print("数据集准备完成！")
        return dataset

    def train(self, train_dataset, output_dir="./nemotron_nano_finetuned", num_epochs=3, batch_size=4,
              gradient_accumulation_steps=4, learning_rate=2e-4, max_seq_length=2048,
              logging_steps=10, save_steps=100):
        """ 执行微调训练
        参数:
            train_dataset: 训练数据集
            output_dir: 输出目录
            num_epochs: 训练轮数
            batch_size: 批处理大小
            gradient_accumulation_steps: 梯度累积步数
            learning_rate: 学习率
            max_seq_length: 最大序列长度
            logging_steps: 日志记录步数
            save_steps: 模型保存步数
        """
        print("开始训练...")
        # 配置训练参数
        from transformers import TrainingArguments
        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=num_epochs,
            per_device_train_batch_size=batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            learning_rate=learning_rate,
            fp16=False,  # 不使用 FP16（使用 BF16）
            bf16=True,  # 使用 BF16
            logging_steps=logging_steps,
            save_steps=save_steps,
            save_total_limit=3,  # 最多保存 3 个检查点
            optim="paged_adamw_8bit",  # 使用 8 位 AdamW 优化器
            lr_scheduler_type="cosine",  # 余弦学习率调度
            warmup_ratio=0.05,  # 5% 的 warmup
            max_grad_norm=1.0,  # 梯度裁剪
            report_to="tensorboard",  # 使用 TensorBoard 记录
            load_best_model_at_end=True,  # 加载最佳模型
            metric_for_best_model="loss",  # 使用损失作为最佳模型指标
            greater_is_better=False,  # 损失越小越好
            ddp_find_unused_parameters=False  # DDP 优化
        )
        # 创建数据整理器
        # 只计算响应部分的损失
        response_template = "<|assistant|>\n"
        collator = DataCollatorForCompletionOnlyLM(
            response_template=response_template,
            tokenizer=self.tokenizer
        )
        # 创建训练器
        trainer = SFTTrainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            tokenizer=self.tokenizer,
            data_collator=collator,
            max_seq_length=max_seq_length,
            packing=False  # 不使用序列打包
        )
        # 开始训练
        print(f"训练配置:")
        print(f" - 轮数：{num_epochs}")
        print(f" - 批大小：{batch_size}")
        print(f" - 梯度累积：{gradient_accumulation_steps}")
        print(f" - 有效批大小：{batch_size * gradient_accumulation_steps}")
        print(f" - 学习率：{learning_rate}")
        print(f" - 最大序列长度：{max_seq_length}")
        trainer.train()
        # 保存最终模型
        print(f"保存模型到：{output_dir}")
        trainer.save_model(output_dir)
        self.tokenizer.save_pretrained(output_dir)
        print("训练完成！")

# 使用示例
if __name__ == "__main__":
    # 初始化微调器
    finetuner = NemotronNanoFineTuner(
        model_name="nvidia/nemotron-3-nano-32b",
        use_4bit=True,  # 使用 4 位量化以在 RTX PC 上运行
        device="cuda"
    )
    # 配置 LoRA
    finetuner.setup_lora(
        r=16,  # LoRA 秩
        lora_alpha=32,  # LoRA alpha
        lora_dropout=0.05,  # Dropout
        target_modules=None  # 使用默认目标模块
    )
    # 准备数据集
    # 这里使用一个示例数据集，实际应用中替换为你的数据集
    train_dataset = finetuner.prepare_dataset(
        dataset_name="timdettmers/openassistant-guanaco",
        max_length=2048,
        num_samples=1000  # 使用 1000 个样本进行演示
    )
    # 执行微调
    finetuner.train(
        train_dataset=train_dataset,
        output_dir="./nemotron_nano_agent_finetuned",
        num_epochs=3,
        batch_size=2,  # 小批大小以适应 RTX PC
        gradient_accumulation_steps=8,  # 大梯度累积以模拟大批大小
        learning_rate=2e-4,
        max_seq_length=2048,
        logging_steps=10,
        save_steps=100
    )
    print("\n微调完成！模型已保存。")
    print("可以使用 Ollama 或 llama.cpp 加载微调后的模型进行推理。")

import torch
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from transformers import AutoTokenizer, AutoModel
import faiss
import numpy as np
from pathlib import Path

class DoclingRAGPipeline:
    """ 基于 Docling 的 RAG 管道
    支持 GPU 加速的文档处理和向量检索
    """
    def __init__(self, embedding_model="nvidia/nv-embed-v1", device="cuda", use_gpu_ocr=True):
        """ 初始化 RAG 管道
        参数:
            embedding_model: 嵌入模型名称
            device: 计算设备
            use_gpu_ocr: 是否使用 GPU 加速 OCR
        """
        self.device = device
        self.use_gpu_ocr = use_gpu_ocr
        print("初始化 Docling RAG 管道...")
        # 初始化文档转换器
        # Docling 支持 PDF、Word、PowerPoint 等多种格式
        self.doc_converter = DocumentConverter(
            allowed_formats=[
                InputFormat.PDF,
                InputFormat.DOCX,
                InputFormat.PPTX,
                InputFormat.HTML,
                InputFormat.IMAGE
            ]
        )
        # 配置 PDF 处理管道（使用 GPU 加速）
        if use_gpu_ocr:
            # 使用 GPU 加速的 OCR 和布局分析
            pdf_pipeline = StandardPdfPipeline(
                backend=PyPdfiumDocumentBackend,
                ocr_enabled=True,
                ocr_engine="easyocr",  # 使用 EasyOCR（支持 GPU）
                device=device,  # 在 GPU 上运行 OCR
                table_structure_enabled=True,  # 启用表格结构识别
                figure_extraction_enabled=True  # 启用图形提取
            )
            self.doc_converter.set_pipeline(InputFormat.PDF, pdf_pipeline)
            print(" - 已启用 GPU 加速 OCR 和布局分析")
        # 加载嵌入模型
        print(f"加载嵌入模型：{embedding_model}")
        self.tokenizer = AutoTokenizer.from_pretrained(embedding_model)
        self.embedding_model = AutoModel.from_pretrained(embedding_model).to(device)
        self.embedding_model.eval()
        # 初始化向量索引
        self.index = None
        self.documents = []
        self.metadata = []
        print("Docling RAG 管道初始化完成！")

    def process_document(self, file_path):
        """ 处理文档并提取内容
        使用 GPU 加速的 OCR 和布局分析
        参数:
            file_path: 文档文件路径
        返回:
            result: 处理结果，包含文本、表格、图像等
        """
        print(f"处理文档：{file_path}")
        # 转换文档
        # Docling 会自动识别文档格式并应用相应的处理管道
        result = self.doc_converter.convert(file_path)
        # 提取文档内容
        document_data = {
            'text': result.document.export_to_markdown(),  # 导出为 Markdown
            'tables': [],
            'figures': [],
            'metadata': result.document.metadata
        }
        # 提取表格
        for table in result.document.tables:
            table_data = {
                'content': table.export_to_dataframe(),  # 导出为 DataFrame
                'caption': table.caption,
                'page': table.page_no
            }
            document_data['tables'].append(table_data)
        # 提取图形
        for figure in result.document.figures:
            figure_data = {
                'image': figure.image,
                'caption': figure.caption,
                'page': figure.page_no
            }
            document_data['figures'].append(figure_data)
        print(f" - 提取文本：{len(document_data['text'])} 字符")
        print(f" - 提取表格：{len(document_data['tables'])} 个")
        print(f" - 提取图形：{len(document_data['figures'])} 个")
        return document_data

    def chunk_text(self, text, chunk_size=512, overlap=50):
        """ 将文本分块用于嵌入
        参数:
            text: 输入文本
            chunk_size: 块大小（token 数）
            overlap: 重叠大小（token 数）
        返回:
            chunks: 文本块列表
        """
        # 使用 tokenizer 进行分块
        tokens = self.tokenizer.encode(text, add_special_tokens=False)
        chunks = []
        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunk_text = self.tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
            start += chunk_size - overlap
        return chunks

    def embed_texts(self, texts, batch_size=32):
        """ 将文本转换为嵌入向量
        使用 GPU 加速批处理
        参数:
            texts: 文本列表
            batch_size: 批处理大小
        返回:
            embeddings: 嵌入向量数组 (N, D)
        """
        print(f"生成嵌入向量（共{len(texts)}个文本）...")
        all_embeddings = []
        # 批处理生成嵌入
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            # Tokenize
            inputs = self.tokenizer(
                batch_texts, padding=True, truncation=True, max_length=512,
                return_tensors="pt").to(self.device)
            # 生成嵌入
            with torch.no_grad():
                outputs = self.embedding_model(**inputs)
                # 使用 [CLS] token 的嵌入或平均池化
                embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            all_embeddings.append(embeddings)
        # 合并所有批次的嵌入
        all_embeddings = np.vstack(all_embeddings)
        print(f" - 嵌入维度：{all_embeddings.shape}")
        return all_embeddings

    def build_index(self, documents_dir, use_gpu_index=True):
        """ 构建向量索引
        参数:
            documents_dir: 文档目录
            use_gpu_index: 是否使用 GPU 加速的 FAISS 索引
        """
        print(f"构建向量索引（文档目录：{documents_dir}）...")
        # 处理所有文档
        doc_path = Path(documents_dir)
        all_chunks = []
        all_metadata = []
        for file_path in doc_path.glob("**/*"):
            if file_path.is_file() and file_path.suffix.lower() in ['.pdf', '.docx', '.pptx', '.html']:
                print(f"\n处理：{file_path.name}")
                try:
                    # 处理文档
                    doc_data = self.process_document(str(file_path))
                    # 分块文本
                    chunks = self.chunk_text(doc_data['text'])
                    # 保存块和元数据
                    for i, chunk in enumerate(chunks):
                        all_chunks.append(chunk)
                        all_metadata.append({'file': str(file_path), 'chunk_id': i, 'total_chunks': len(chunks)})
                    print(f" - 生成 {len(chunks)} 个文本块")
                except Exception as e:
                    print(f" - 处理失败：{e}")
                    continue
        print(f"\n总共处理 {len(all_chunks)} 个文本块")
        # 生成嵌入
        embeddings = self.embed_texts(all_chunks)
        # 创建 FAISS 索引
        dimension = embeddings.shape[1]
        if use_gpu_index and torch.cuda.is_available():
            # 使用 GPU 加速的 FAISS 索引
            print("创建 GPU 加速的 FAISS 索引...")
            # 创建索引
            cpu_index = faiss.IndexFlatL2(dimension)
            # 转移到 GPU
            gpu_resources = faiss.StandardGpuResources()
            self.index = faiss.index_cpu_to_gpu(gpu_resources, 0, cpu_index)
        else:
            # 使用 CPU 索引
            print("创建 CPU FAISS 索引...")
            self.index = faiss.IndexFlatL2(dimension)
        # 添加向量到索引
        self.index.add(embeddings.astype('float32'))
        # 保存文档和元数据
        self.documents = all_chunks
        self.metadata = all_metadata
        print(f"索引构建完成！")
        print(f" - 索引大小：{self.index.ntotal}")
        print(f" - 向量维度：{dimension}")

    def search(self, query, top_k=5):
        """ 搜索相关文档
        参数:
            query: 查询文本
            top_k: 返回前 K 个结果
        返回:
            results: 搜索结果列表
        """
        if self.index is None:
            raise ValueError("索引未构建，请先调用 build_index()")
        print(f"搜索：{query}")
        # 生成查询嵌入
        query_embedding = self.embed_texts([query])
        # 搜索
        distances, indices = self.index.search(query_embedding.astype('float32'), top_k)
        # 整理结果
        results = []
        for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
            result = {
                'rank': i + 1,
                'score': float(dist),
                'text': self.documents[idx],
                'metadata': self.metadata[idx]
            }
            results.append(result)
        print(f"找到 {len(results)} 个结果")
        return results

    def generate_answer(self, query, llm_model, top_k=3):
        """ 使用 RAG 生成答案
        参数:
            query: 用户查询
            llm_model: 语言模型（如 Nemotron Nano）
            top_k: 检索的文档数量
        返回:
            answer: 生成的答案
        """
        # 检索相关文档
        results = self.search(query, top_k=top_k)
        # 构建上下文
        context = "\n\n".join([f"文档 {r['rank']}:\n{r['text']}" for r in results])
        # 构建提示词
        prompt = f"""基于以下文档回答问题。如果文档中没有相关信息，请说明无法回答。
文档：{context}
问题：{query}
答案:"""
        # 使用 LLM 生成答案
        # 这里假设 llm_model 是一个可调用的模型
        answer = llm_model(prompt)
        return {'answer': answer, 'sources': [r['metadata'] for r in results]}

# 使用示例
if __name__ == "__main__":
    # 初始化 RAG 管道
    rag = DoclingRAGPipeline(
        embedding_model="nvidia/nv-embed-v1",
        device="cuda",
        use_gpu_ocr=True  # 使用 GPU 加速 OCR，4x 性能提升
    )
    # 构建索引
    rag.build_index(
        documents_dir="/path/to/documents",
        use_gpu_index=True  # 使用 GPU 加速的 FAISS 索引
    )
    # 搜索示例
    query = "What are the key features of NVIDIA RTX GPUs?"
    results = rag.search(query, top_k=5)
    print("\n搜索结果:")
    for result in results:
        print(f"\n排名 {result['rank']} (得分：{result['score']:.4f})")
        print(f"文件：{result['metadata']['file']}")
        print(f"内容：{result['text'][:200]}...")
    # RAG 生成答案示例
    # 这里需要一个 LLM 模型，如 Nemotron Nano
    # answer_result = rag.generate_answer(query, llm_model=your_llm_model)
    # print(f"\n答案：{answer_result['answer']}")
    # print(f"来源：{answer_result['sources']}")

import cv2
import numpy as np
import torch
from nvidia_vfx import VideoEffects, Effect

class NVIDIAVideoEffectsProcessor:
    """ NVIDIA 视频效果处理器
    支持 AI 驱动的背景替换、重新照明、眼神接触等效果
    """
    def __init__(self, device="cuda"):
        """ 初始化视频效果处理器
        参数:
            device: 计算设备
        """
        self.device = device
        print("初始化 NVIDIA 视频效果 SDK...")
        # 初始化视频效果引擎
        self.vfx = VideoEffects(device=device)
        # 加载可用的效果
        self.effects = {
            'background_blur': Effect.BACKGROUND_BLUR,  # 背景模糊
            'background_replace': Effect.BACKGROUND_REPLACE,  # 背景替换
            'virtual_background': Effect.VIRTUAL_BACKGROUND,  # 虚拟背景
            'video_relighting': Effect.VIDEO_RELIGHTING,  # 视频重新照明（新增强）
            'eye_contact': Effect.EYE_CONTACT,  # 眼神接触
            'face_tracking': Effect.FACE_TRACKING,  # 面部跟踪
            'super_resolution': Effect.SUPER_RESOLUTION  # 超分辨率
        }
        print("视频效果 SDK 初始化完成！")
        print(f"可用效果：{list(self.effects.keys())}")

    def setup_background_replacement(self, background_image_path=None, blur_strength=0.8):
        """ 配置背景替换效果
        参数:
            background_image_path: 背景图像路径（None 表示使用模糊）
            blur_strength: 模糊强度（0-1）
        """
        print("配置背景替换效果...")
        if background_image_path:
            # 加载背景图像
            background = cv2.imread(background_image_path)
            background = cv2.cvtColor(background, cv2.COLOR_BGR2RGB)
            # 配置虚拟背景
            self.vfx.set_effect(self.effects['virtual_background'], background_image=background)
            print(f" - 使用自定义背景：{background_image_path}")
        else:
            # 配置背景模糊
            self.vfx.set_effect(self.effects['background_blur'], blur_strength=blur_strength)
            print(f" - 使用背景模糊（强度：{blur_strength}）")

    def setup_video_relighting(self, light_direction=(0.0, -1.0, 0.0), light_intensity=1.0, ambient_intensity=0.3):
        """ 配置视频重新照明效果（CES 2026 增强版）
        性能提升 3 倍，模型大小减少 6 倍
        参数:
            light_direction: 光源方向（x, y, z）
            light_intensity: 光源强度
            ambient_intensity: 环境光强度
        """
        print("配置视频重新照明效果（增强版）...")
        # 配置重新照明参数
        self.vfx.set_effect(
            self.effects['video_relighting'],
            light_direction=light_direction,  # 光源方向
            light_intensity=light_intensity,  # 光源强度
            ambient_intensity=ambient_intensity,  # 环境光强度
            use_enhanced_model=True,  # 使用增强模型（CES 2026 新增）
            min_gpu="RTX_3060"  # 最低 GPU 要求：RTX 3060
        )
        print(f" - 光源方向：{light_direction}")
        print(f" - 光源强度：{light_intensity}")
        print(f" - 环境光强度：{ambient_intensity}")
        print(f" - 性能提升：3x")
        print(f" - 模型大小：减少 6x")

    def setup_eye_contact(self, strength=0.9):
        """ 配置眼神接触效果
        自动调整眼睛方向使其看向摄像头
        参数:
            strength: 效果强度（0-1）
        """
        print("配置眼神接触效果...")
        self.vfx.set_effect(self.effects['eye_contact'], strength=strength)
        print(f" - 效果强度：{strength}")

    def process_frame(self, frame):
        """ 处理单个视频帧
        应用所有已配置的效果
        参数:
            frame: 输入帧（BGR 格式）
        返回:
            processed_frame: 处理后的帧
        """
        # 转换为 RGB
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        # 转换为张量
        frame_tensor = torch.from_numpy(frame_rgb).permute(2, 0, 1).unsqueeze(0)
        frame_tensor = frame_tensor.float() / 255.0
        frame_tensor = frame_tensor.to(self.device)
        # 应用效果
        with torch.no_grad():
            processed_tensor = self.vfx.process(frame_tensor)
        # 转换回 numpy
        processed_frame = processed_tensor.squeeze(0).permute(1, 2, 0).cpu().numpy()
        processed_frame = (processed_frame * 255).astype(np.uint8)
        # 转换回 BGR
        processed_frame = cv2.cvtColor(processed_frame, cv2.COLOR_RGB2BGR)
        return processed_frame

    def process_video(self, input_video_path, output_video_path, show_preview=True):
        """ 处理整个视频
        参数:
            input_video_path: 输入视频路径
            output_video_path: 输出视频路径
            show_preview: 是否显示预览
        """
        print(f"处理视频：{input_video_path}")
        # 打开输入视频
        cap = cv2.VideoCapture(input_video_path)
        # 获取视频属性
        fps = int(cap.get(cv2.CAP_PROP_FPS))
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        print(f"视频属性:")
        print(f" - 分辨率：{width}x{height}")
        print(f" - 帧率：{fps} fps")
        print(f" - 总帧数：{total_frames}")
        # 创建输出视频
        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
        out = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))
        # 处理每一帧
        frame_count = 0
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            # 处理帧
            processed_frame = self.process_frame(frame)
            # 写入输出
            out.write(processed_frame)
            # 显示预览
            if show_preview:
                # 创建并排对比
                comparison = np.hstack([frame, processed_frame])
                cv2.imshow('Original | Processed', comparison)
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break
            frame_count += 1
            if frame_count % 30 == 0:
                progress = (frame_count / total_frames) * 100
                print(f"处理进度：{progress:.1f}% ({frame_count}/{total_frames})")
        # 清理
        cap.release()
        out.release()
        cv2.destroyAllWindows()
        print(f"视频处理完成！")
        print(f"输出保存到：{output_video_path}")

    def process_webcam(self, camera_id=0):
        """ 实时处理摄像头输入
        参数:
            camera_id: 摄像头 ID
        """
        print(f"启动实时处理（摄像头 {camera_id}）...")
        print("按 'q' 退出")
        # 打开摄像头
        cap = cv2.VideoCapture(camera_id)
        # 设置分辨率
        cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
        cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            # 处理帧
            processed_frame = self.process_frame(frame)
            # 显示
            comparison = np.hstack([frame, processed_frame])
            cv2.imshow('Webcam: Original | Processed', comparison)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        cap.release()
        cv2.destroyAllWindows()

# 使用示例
if __name__ == "__main__":
    # 初始化处理器
    processor = NVIDIAVideoEffectsProcessor(device="cuda")
    # 示例 1：背景替换
    print("\n=== 示例 1：背景替换 ===")
    processor.setup_background_replacement(background_image_path="/path/to/background.jpg")
    # 示例 2：视频重新照明（CES 2026 增强版）
    print("\n=== 示例 2：视频重新照明 ===")
    processor.setup_video_relighting(
        light_direction=(0.5, -1.0, 0.3),  # 从右上方照射
        light_intensity=1.2,  # 较强的光源
        ambient_intensity=0.4  # 适度的环境光
    )
    # 示例 3：眼神接触
    print("\n=== 示例 3：眼神接触 ===")
    processor.setup_eye_contact(strength=0.9)
    # 处理视频文件
    processor.process_video(
        input_video_path="input_video.mp4",
        output_video_path="output_video_enhanced.mp4",
        show_preview=True
    )
    # 或者实时处理摄像头
    # processor.process_webcam(camera_id=0)

NVIDIA RTX PC 开源 AI 工具升级：LLM 与扩散模型性能优化

NVIDIA RTX PC 开源 AI 工具升级：LLM 与扩散模型性能优化

ComfyUI 的持续性能改进：扩散模型加速的新里程碑

ComfyUI 核心优化特性详解

ComfyUI NVFP4 量化工作流代码示例

更多推荐文章

相关免费在线工具

ComfyUI 混合精度配置示例

ComfyUI 权重流式传输优化

llama.cpp 和 Ollama 的 RTX AI PC 加速

llama.cpp 核心优化详解

llama.cpp GPU Token 采样完整示例

llama.cpp CUDA 图优化和 QKV 并发

Ollama 优化配置示例

LTX-2 高级音视频模型：RTX AI PC 上的云级性能

LTX-2 音视频生成完整示例

LTX-2 批量生成和优化

本地 AI 智能体工具包：Nemotron 3 Nano 和 Docling

Nemotron 3 Nano 微调示例

Docling RAG 管道示例

音视频效果 SDK：AI 增强的多媒体处理

视频效果 SDK 集成示例

总结与展望

相关资源

更多推荐文章

相关免费在线工具

NVIDIA RTX PC 开源 AI 工具升级：LLM 与扩散模型性能优化

NVIDIA RTX PC 开源 AI 工具升级：LLM 与扩散模型性能优化

ComfyUI 的持续性能改进：扩散模型加速的新里程碑

ComfyUI 核心优化特性详解

ComfyUI NVFP4 量化工作流代码示例

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

ComfyUI 混合精度配置示例

ComfyUI 权重流式传输优化

llama.cpp 和 Ollama 的 RTX AI PC 加速

llama.cpp 核心优化详解

llama.cpp GPU Token 采样完整示例

llama.cpp CUDA 图优化和 QKV 并发

Ollama 优化配置示例

LTX-2 高级音视频模型：RTX AI PC 上的云级性能

LTX-2 音视频生成完整示例

LTX-2 批量生成和优化

本地 AI 智能体工具包：Nemotron 3 Nano 和 Docling

Nemotron 3 Nano 微调示例

Docling RAG 管道示例

音视频效果 SDK：AI 增强的多媒体处理

视频效果 SDK 集成示例

总结与展望

相关资源

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具