LLM 微调实战：使用 Code-Llama 训练自定义代码数据集 | 极客日志

PythonAI算法

LLM 微调实战：使用 Code-Llama 训练自定义代码数据集

综述由AI生成基于 PEFT 和 LoRA 技术对 Code-Llama 模型进行微调的完整流程。涵盖数据格式准备、模型加载与量化配置、训练参数设置、微调执行及推理测试等关键环节。通过实例演示如何将通用代码大模型适配到特定业务场景，提升代码生成能力。重点介绍了从 CSV 数据清洗到 JSONL 转换、LoRA 参数配置、Trainer 训练循环搭建以及适配器加载推理的全过程，适用于需要定制代码能力的开发者参考。

RustyLab发布于 2025/2/6更新于 2026/6/1224 浏览

LLM 微调实战：使用 Code-Llama 训练自定义代码数据集

概述

大语言模型（LLM）在代码生成领域表现卓越，但通用模型往往难以完全适配特定项目的编码规范或业务逻辑。通过微调（Fine-tuning），我们可以利用私有数据增强模型在特定领域的代码理解与生成能力。本文将以 Code-Llama-7b-Instruct-hf 为例，演示如何使用 PEFT（Parameter-Efficient Fine-Tuning）库中的 LoRA 技术进行高效微调。

环境准备

确保已安装必要的依赖库。主要涉及 Python、PyTorch、Transformers 以及 PEFT 库。

pip install torch transformers peft datasets accelerate bitsandbytes

若使用 8bit 量化加载模型，需确保 bitsandbytes 版本兼容当前 CUDA 环境。

数据准备

微调数据通常采用 JSONL 格式。对于代码任务，建议包含上下文（context）、问题（question）和答案（answer）。

以下脚本展示了如何将 CSV 格式的原始数据转换为训练集和验证集的 JSONL 文件，并进行随机打乱划分。

import pandas as pd
import random
import json

# 读取原始数据
data = pd.read_csv('dataset.csv')
train_data = data[['prompt', 'Code']].values.tolist()
random.shuffle(train_data)

# 划分训练集与验证集（8:2）
train_num = int(0.8 * len(train_data))

with open('train_data.jsonl', 'w', encoding='utf-8') as f:
    for d in train_data[:train_num]:
        record = {
            'context': '',
            'question': d[0],
            'answer': d[1]
        }
        f.write(json.dumps(record, ensure_ascii=False) + '\n')

with open('val_data.jsonl', 'w', encoding=)  f:
     d  train_data[train_num:]:
        record = {
            : ,
            : d[],
            : d[]
        }
        f.write(json.dumps(record, ensure_ascii=) + )

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

from datetime import datetime
import os
import sys
import torch

from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from datasets import load_dataset

# 加载数据集
train_dataset = load_dataset('json', data_files='train_data.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='val_data.jsonl', split='train')

# 指定基座模型
base_model = 'CodeLlama-7b-Instruct-hf'

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True
)

tokenizer = AutoTokenizer.from_pretrained(base_model)

tokenizer.pad_token = tokenizer.eos_token
prompt = """You are programming coder.

Now answer the question:

{}"""
prompts = [prompt.format(train_dataset[i]['question']) for i in [1, 20, 32, 45, 67]]

model_input = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

model.eval()
with torch.no_grad():
    outputs = model.generate(**model_input, max_new_tokens=300)
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print(outputs)

tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )
    # 自监督学习：标签与输入一致
    result["labels"] = result["input_ids"].copy()
    return result

def generate_and_tokenize_prompt(data_point):
    full_prompt = f"""You are a powerful programming model. Your job is to answer questions about a database. You are given a question.

You must output the code that answers the question.

### Input:
{data_point["question"]}

### Response:
{data_point["answer"]}
"""
    return tokenize(full_prompt)

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

model.train() 
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

if torch.cuda.device_count() > 1:
    model.is_parallelizable = True
    model.model_parallel = True

batch_size = 128
per_device_train_batch_size = 32
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "code-llama-ft"

training_args = TrainingArguments(
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    warmup_steps=100,
    max_steps=400,
    learning_rate=3e-4,
    fp16=True,
    logging_steps=10,
    optim="adamw_torch",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=20,
    save_steps=20,
    output_dir=output_dir,
    load_best_model_at_end=False,
    group_by_length=True,
    report_to="none",
    run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}",
)

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

trainer.train()

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = 'CodeLlama-7b-Instruct-hf'
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model)

output_dir = "code-llama-ft"
model = PeftModel.from_pretrained(model, output_dir)

eval_prompt = """You are a powerful programming model. Your job is to answer questions about a database. You are given a question.

You must output the code that answers the question.

### Input:
Write a function in Java that takes an array and returns the sum of the numbers in the array, or 0 if the array is empty. Except the number 13 is very unlucky, so it does not count any 13, or any number that immediately follows a 13.

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    outputs = model.generate(**model_input, max_new_tokens=100)[0]
print(tokenizer.decode(outputs, skip_special_tokens=True))

LLM 微调实战：使用 Code-Llama 训练自定义代码数据集

LLM 微调实战：使用 Code-Llama 训练自定义代码数据集

概述

环境准备

数据准备

更多推荐文章

相关免费在线工具

模型初始化

微调前效果测试

微调配置

训练参数设置

开始训练

推理测试

总结与最佳实践

更多推荐文章

相关免费在线工具

LLM 微调实战：使用 Code-Llama 训练自定义代码数据集

LLM 微调实战：使用 Code-Llama 训练自定义代码数据集

概述

环境准备

数据准备

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

模型初始化

微调前效果测试

微调配置

训练参数设置

开始训练

推理测试

总结与最佳实践

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具