基于大语言模型的 Aspect-Based Sentiment Analysis 数据标注实践

探讨了使用大语言模型（LLM）进行 Aspect-Based Sentiment Analysis（ABSA）数据标注的实践。文章对比了传统人工标注与基于 LLM 的流程，分析了 GPT-4、GPT-3.5 及 GPT-4o 在成本、速度和性能上的差异。实验显示 GPT-4 标注质量高但成本高，而监督微调（SFT）虽成本低速度快但性能稍逊。文中提供了详细的提示词设计方案、Python API 调用示例以及针对一致性、幻觉和 Token 限制的实施建议，为技术团队选择合适的标注方案提供参考。

云间漫步发布于 2025/2/7更新于 2026/7/2134 浏览

基于大语言模型的数据标注实践：以 ABSA 任务为例

摘要

数据标注是大语言模型（LLMs）的重要应用场景之一。本文分享了使用 ChatGPT（API 版本 3.5 和 4）进行 Aspect-Based Sentiment Analysis（ABSA，即基于方面的情感分析）任务的实践经验与见解。选择 ABSA 作为示例，是因为它是一项具有挑战性的 NLP 任务，且团队此前已基于 BERT 处理过类似项目。

关键要点

LLMs 可以有效执行数据标注任务，表现水平接近人类，无需人工标注和传统模型训练，从而节省时间和成本。
使用 GPT-4 标注 200 万条评论，配合长 Few-Shot 提示词，成本约为 3 万美元。
监督微调（SFT）更便宜且更快，但性能远不如 GPT-4（注：这可能与微调数据质量有关）。

主要议题

不同的标注过程：人工 vs 基于 LLM
不同的方法（提示词工程 vs 小批量处理 vs 微调）及其时间/成本影响
不同方法的性能比较

ABSA 任务和数据集

基于方面的情感分析（ABSA）旨在识别和提取产品或服务特定方面的情感。该领域有许多论文、数据集和比赛。

例如，以下是一条餐厅评论，包含四个方面的情感：

这个地方非常酷，装饰也很棒。饮料还可以，但有点贵。

氛围：正面
食物：中性
价格：负面
服务：未提及

我们为一个研究项目创建了一个酒店评论的 ABSA 数据集，包含约 200 万条酒店评论，涉及三个方面：

员工服务
服务机器人的服务
人机互动

我们雇用了数据标注员手动标注了约 2.5 万条评论，并训练了一个模型来标注其余的数据集。

传统标注流程

一个 ABSA 任务通常包括以下步骤：

手动标注一部分数据：分两步进行，首先定义方向，然后标注相应的情感。检查标注的一致性，如果人手不够则需要更多的标注员。
模型训练与评估：使用标注的数据训练一个模型并检查性能（如有需要则标注更多数据）。
预测：使用训练好的模型预测其余的数据。

这个过程非常耗费人力和成本。例如，我们手动标注了约 2.5 万条评论，几位标注员花了几周时间完成。

参考资源：BERT-LSTM-based-ABSA

基于 LLM 的标注流程

使用 LLM 进行 ABSA 的过程如下：

构建测试集：手动标注一个小的测试数据集，例如几百条评论。
编写提示词：编写一个 Few-Shot 标注提示词并标注 100 条评论。
审查与迭代：审查初步的标注结果，将任何错误标注的示例，特别是有挑战性的案例，作为补充示例加入 Few-Shot 提示词。根据需要重复此过程。
全量标注：使用最终提示词标注其余的数据集。

提示词工程设计

提示词按照上述步骤构建，最终结构如下所示。设计时需注意明确维度定义、情感类别及输出格式。

You are an experienced data labeling engineer with extensive experience in labeling hotel reviews. Your task is to classify a review based on three dimensions, with four categories: positive, negative, neutral, and not mentioned.

The definitions and examples of the three dimensions are as follows:

Dimension 1: Quality of hotel staff service
Definition: customer perceptions directly related to staff behavior or attitude, such as timely service, skilled, knowledgeable, professional, polite, caring, understanding, sincere, helpful, etc.
Examples:
Review: The cleaning lady cleans in a timely manner.
Sentiment: Positive
Review: Staff were testing robots in the hallway, the noise was very loud and annoying, and the front desk did nothing about it!
Sentiment: Negative
[more examples]...

Dimension 2: Quality of robot service
Definition: customer perceptions of robot functionality or perceptions of the service result after using the robot
Examples:
Review: The robot is very convenient
Sentiment: Positive
Review: The robot delivers too slowly
Sentiment: Negative
[more examples]...

Dimension 3: Human-robot interaction perception
Definition: customer perceptions other than robot functionality, such as robot social intelligence (communication understanding ability), robot social existence (making one feel it has human characteristics or experiences a human can bring), robot design and novelty (voice, and posture freshness, curiosity, advanced, coolness).
Examples:
Review: The little robot speaks adorably, too cute
Sentiment: Positive
Review: The robot's voice is too loud and noisy;
Sentiment: Negative
[more examples]...

Now, classify the sentiment of the following review into three dimensions using a JSON object as the output method, with "employee_service", "robot_service", "human_robot_interaction" as the keys and the value is one of positive/negative/neutral/unknown.

Here is the hotel review:

模型	输入 Token 成本估算	单条评论成本
GPT-4	$0.015	$0.015
GPT-3.5	$0.00075	$0.00075
GPT-4o	$0.0075	$0.0075

基于大语言模型的 Aspect-Based Sentiment Analysis 数据标注实践

基于大语言模型的数据标注实践：以 ABSA 任务为例

摘要

关键要点