GLM4 大模型微调实战：命名实体识别（NER）任务指南

GLM4 大模型微调实战：掌握命名实体识别（NER）任务的精要

GLM4 是清华智谱团队开源的大语言模型。以 GLM4 作为基座大模型，通过指令微调的方式做高精度的命名实体识别（NER），是学习入门 LLM 微调、建立大模型认知的非常好的任务。

显存要求相对较高，建议配置 16GB 以上显存的显卡进行 LoRA 微调。

在本文中，我们会使用 GLM4-9b-Chat 模型在中文 NER 数据集上做指令微调训练，同时使用 SwanLab 监控训练过程、评估模型效果。

知识点 1：什么是指令微调？

大模型指令微调（Instruction Tuning）是一种针对大型预训练语言模型的微调技术，其核心目的是增强模型理解和执行特定指令的能力，使模型能够根据用户提供的自然语言指令准确、恰当地生成相应的输出或执行相关任务。

指令微调特别关注于提升模型在遵循指令方面的一致性和准确性，从而拓宽模型在各种应用场景中的泛化能力和实用性。

在实际应用中，我的理解是，指令微调更多把 LLM 看作一个更智能、更强大的传统 NLP 模型（比如 Bert），来实现更高精度的 NLP 任务。所以这类任务的应用场景覆盖了以往 NLP 模型的场景，甚至很多团队拿它来标注互联网数据。

知识点 2：什么是命名实体识别？

命名实体识别 (NER) 是一种 NLP 技术，主要用于识别和分类文本中提到的重要信息（关键词）。这些实体可以是人名、地名、机构名、日期、时间、货币值等等。NER 的目标是将文本中的非结构化信息转换为结构化信息，以便计算机能够更容易地理解和处理。

NER 也是一项非常实用的技术，包括在互联网数据标注、搜索引擎、推荐系统、知识图谱、医疗保健等诸多领域有广泛应用。

1. 环境安装

本案例基于 Python>=3.8，请在您的计算机上安装好 Python，并且有一张英伟达显卡（显存要求并不高，大概 10GB 左右就可以跑 LoRA 微调）。

我们需要安装以下这几个 Python 库，在这之前，请确保你的环境内已安装好了 PyTorch 以及 CUDA：

pip install swanlab modelscope transformers datasets peft pandas accelerate tiktoken

本案例测试于 modelscope 1.14.0、transformers 4.41.2、datasets 2.18.0、peft 0.11.1、accelerate 0.30.1、swanlab 0.3.11、tiktoken==0.7.0

虚拟环境建议： 为了避免依赖冲突，建议使用 conda 或 venv 创建独立环境：

conda create -n glm4_ner python=3.9
conda activate glm4_ner
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. 准备数据集

本案例使用的是 HuggingFace 上的 chinese_ner_sft 数据集，该数据集主要被用于训练命名实体识别模型。

chinese_ner_sft 由不同来源、不同类型的几十万条数据组成，应该是我见过收录最齐全的中文 NER 数据集。

这次训练我们不需要用到它的全部数据，只取其中的 CCFBDCI 数据集（中文命名实体识别算法鲁棒性评测数据集）进行训练，该数据集包含 LOC（地点）、GPE（地理）、ORG（组织）和 PER（人名）四种实体类型标注，每条数据的例子如下：

{
  "text": "今天亚太经合组织第十二届部长级会议在这里开幕，中国外交部部长唐家璇、外经贸部部长石广生出席了会议。",
  "entities": [
    {
        "start_idx":

import json import pandas as pd import torch from datasets import Dataset from modelscope import snapshot_download, AutoTokenizer from swanlab.integration.huggingface import SwanLabCallback from peft import LoraConfig, TaskType, get_peft_model from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq import os import swanlab def dataset_jsonl_transfer(origin_path, new_path): """ 将原始数据集转换为大模型微调所需数据格式的新数据集 """ messages = [] # 读取旧的 JSONL 文件 with open(origin_path, "r") as file: for line in file: # 解析每一行的 json 数据 data = json.loads(line) input_text = data["text"] entities = data["entities"] match_names = ["地点", "人名", "地理实体", "组织"] entity_sentence = "" for entity in entities: entity_json = dict(entity) entity_text = entity_json["entity_text"] entity_names = entity_json["entity_names"] for name in entity_names: if name in match_names: entity_label = name break entity_sentence += f"{{\"entity_text\": \"{entity_text}\", \"entity_label\": \"{entity_label}\"}}" if entity_sentence == "": entity_sentence = "没有找到任何实体" message = { "instruction": """你是一个文本实体识别领域的专家，你需要从给定的句子中提取地点; 人名; 地理实体; 组织实体。以 json 格式输出，如 {"entity_text": "南京", "entity_label": "地理实体"} 注意：1. 输出的每一行都必须是正确的 json 字符串。2. 找不到任何实体时，输出"没有找到任何实体"。""", "input": f"文本:{input_text}", "output": entity_sentence, } messages.append(message) # 保存重构后的 JSONL 文件 with open(new_path, "w", encoding="utf-8") as file: for message in messages: file.write(json.dumps(message, ensure_ascii=False) + "\n") def process_func(example): """ 对数据集进行数据预处理，主要用于被 dataset.map 调用 """ MAX_LENGTH = 384 input_ids, attention_mask, labels = [], [], [] system_prompt = """你是一个文本实体识别领域的专家，你需要从给定的句子中提取地点; 人名; 地理实体; 组织实体。以 json 格式输出，如 {"entity_text": "南京", "entity_label": "地理实体"} 注意：1. 输出的每一行都必须是正确的 json 字符串。2. 找不到任何实体时，输出"没有找到任何实体".""" instruction = tokenizer( f"<|system|>\n{system_prompt}<|endoftext|>\n<|user|>\n{example['input']}<|endoftext|>\n<|assistant|>\n", add_special_tokens=False, ) response = tokenizer(f"{example['output']}", add_special_tokens=False) input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id] attention_mask = ( instruction["attention_mask"] + response["attention_mask"] + [1] ) labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id] if len(input_ids) > MAX_LENGTH: # 做一个截断 input_ids = input_ids[:MAX_LENGTH] attention_mask = attention_mask[:MAX_LENGTH] labels = labels[:MAX_LENGTH] return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels} def predict(messages, model, tokenizer): """对测试集进行模型推理，得到预测结果""" device = "cuda" text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) return response model_id = "ZhipuAI/glm-4-9b-chat" model_dir = "./ZhipuAI/glm-4-9b-chat/" # 在 modelscope 上下载 GLM4 模型到本地目录下 model_dir = snapshot_download(model_id, cache_dir="./", revision="master") # Transformers 加载模型权重 tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True) model.enable_input_require_grads() # 开启梯度检查点时，要执行该方法 # 加载、处理数据集和测试集 train_dataset_path = "ccfbdci.jsonl" train_jsonl_new_path = "ccf_train.jsonl" if not os.path.exists(train_jsonl_new_path): dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path) # 得到训练集 total_df = pd.read_json(train_jsonl_new_path, lines=True) train_df = total_df[int(len(total_df) * 0.1):] train_ds = Dataset.from_pandas(train_df) train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names) # 配置 LoRA config = LoraConfig( task_type=TaskType.CAUSAL_LM, target_modules=["query_key_value", "dense", "dense_h_to_4h", "activation_func", "dense_4h_to_h"], inference_mode=False, # 训练模式 r=8, # Lora 秩 lora_alpha=32, # Lora alpha，具体作用参见 Lora 原理 lora_dropout=0.1, # Dropout 比例 ) # 得到被 peft 包装后的模型 model = get_peft_model(model, config) # 配置 Transformers 训练参数 args = TrainingArguments( output_dir="./output/GLM4-NER", per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=4, logging_steps=10, num_train_epochs=2, save_steps=100, learning_rate=1e-4, save_on_each_node=True, gradient_checkpointing=True, report_to="none", ) # 设置 SwanLab 与 Transformers 的回调 swanlab_callback = SwanLabCallback( project="GLM4-NER-fintune", experiment_name="GLM4-9B-Chat", description="使用智谱 GLM4-9B-Chat 模型在 NER 数据集上微调，实现关键实体识别任务。", config={ "model": model_id, "model_dir": model_dir, "dataset": "qgyd2021/chinese_ner_sft", }, ) # 设置 Transformers Trainer trainer = Trainer( model=model, args=args, train_dataset=train_dataset, data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True), callbacks=[swanlab_callback], ) # 开始训练 trainer.train() # 用随机 20 条数据测试模型 test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20) test_text_list = [] for index, row in test_df.iterrows(): instruction = row['instruction'] input_value = row['input'] messages = [ {"role": "system", "content": f"{instruction}"}, {"role": "user", "content": f"{input_value}"} ] response = predict(messages, model, tokenizer) messages.append({"role": "assistant", "content": f"{response}"}) result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}" test_text_list.append(swanlab.Text(result_text, caption=response)) # 记录测试结果 swanlab.log({"Prediction": test_text_list}) # 关闭 SwanLab 记录 swanlab.finish()

GLM4 大模型微调实战：命名实体识别（NER）任务指南

GLM4 大模型微调实战：掌握命名实体识别（NER）任务的精要

知识点 1：什么是指令微调？

知识点 2：什么是命名实体识别？

1. 环境安装

2. 准备数据集

更多推荐文章

相关免费在线工具

3. 加载模型

4. 配置 LoRA

5. 配置训练可视化工具

6. 完整代码

7. 训练结果演示

8. 推理训练好的模型

9. 最佳实践与常见问题

超参数调整建议

数据质量优化

部署注意事项

10. 总结

更多推荐文章

相关免费在线工具

GLM4 大模型微调实战：命名实体识别（NER）任务指南

GLM4 大模型微调实战：掌握命名实体识别（NER）任务的精要

知识点 1：什么是指令微调？

知识点 2：什么是命名实体识别？

1. 环境安装

2. 准备数据集

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 加载模型

4. 配置 LoRA

5. 配置训练可视化工具

6. 完整代码

7. 训练结果演示

8. 推理训练好的模型

9. 最佳实践与常见问题

超参数调整建议

数据质量优化

部署注意事项

10. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具