LLM 大模型测试策略与方法 | 极客日志

PythonAI算法

LLM 大模型测试策略与方法

综述由AI生成LLM 大模型测试的核心策略与方法。涵盖单元测试、功能测试、回归测试及可靠性测试等多种类型，并深入探讨了基于 DeepEval 框架的自动化测试实现。重点分析了准确性、相似度及虚构性测试指标，结合 G-Eval 工具提供评估方案。最后阐述了在 CI/CD 流程中集成 LLM 测试的最佳实践，旨在帮助开发者构建更稳健的大模型应用系统。

微码行者发布于 2025/2/7更新于 2026/6/526 浏览

DeepEval 是一个用于对语言模型（LLM）应用进行评估和单元测试的框架。它提供了各种指标，可以测试语言模型应用生成的回复在相关性、一致性、无偏见性和无毒性等方面的表现。DeepEval 使得机器学习工程师可以通过持续集成/持续交付（CI/CD）流程快速评估语言模型应用的性能。

LLM 测试架构

此前分享过一篇 LLM 评估指标的文章，这篇文章深入探讨如何使用指标进行 LLM 评估。本文探讨 LLM 测试是什么，不同的测试方法以及需要关注的边界情况，突出 LLM 测试的最佳实践，并通过 DeepEval 这个开源的 LLM 测试框架介绍如何进行 LLM 测试。

LLM 测试即对大模型测试

LLM 测试是根据其预期评估 LLM 输出是否满足所有特定评估标准（如准确性、连贯性、公平性和安全性等）的过程。

LLM 测试分类

功能测试、性能测试和可靠性测试，这些测试共同组成回归测试。

评估 LLMs 是一个复杂的过程，因为与传统软件开发不同，LLMs 的结果不可预测，缺陷也无法像逻辑可以归因于特定代码块那样进行调试。LLMs 是一个黑盒，具有无限可能的输入和输出。

然而，这并不意味着传统软件测试中的概念不能应用于测试 LLMs。单元测试构成了功能测试、性能测试和可靠性测试的基础，它们共同构成了对 LLM 的回归测试体系。

单元测试

单元测试指的是测试应用程序中最小可测试部分，对于 LLMs 来说，这意味着根据一些明确定义的标准来评估 LLM 对给定输入的响应。

例如，对于一个单元测试，目的是评估由 LLM 生成的摘要的质量，评估标准可以是摘要是否包含足够的信息，以及是否包含来自原始文本的虚构。对评估标准的评分通常由所谓的 LLM 评估度量来完成。

你可以选择自研 LLM 测试框架，但在本文中，我们将使用 DeepEval 框架创建和评估单元测试用例。

pip install deepeval

然后，创建一个测试用例：

from deepeval.test_case import LLMTestCase

original_text="""In the rapidly evolving digital landscape, the proliferation of artificial intelligence (AI) technologies has been a game-changer in various industries, ranging from healthcare to finance. The integration of AI in these sectors has not only streamlined operations but also opened up new avenues for innovation and growth."""
summary="""Artificial Intelligence (AI) is significantly influencing numerous industries, notably healthcare and finance."""
test_case = LLMTestCase(
    input=original_text,
    actual_output=summary
)

在这里，'input'指的是对 LLM 的输入，而'actual_output'则是 LLM 的输出。使用 DeepEval 的摘要度量对该测试用例进行评估：

export OPENAI_API_KEY="..."
from deepeval.metrics import SummarizationMetric

metric = SummarizationMetric(threshold=)
metric.measure(test_case)
(metric.score)
(metric.reason)
(metric.is_successful())

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

touch test_summarization.py

from deepeval.test_case import LLMTestCase

# Hypothetical test data from your test dataset,
# containing the original text and summary to evaluate a summarization task
test_data = [
    {
        "original_text": "...",
        "summary": "..."
    },
    {
        "original_text": "...",
        "summary": "..."
    }
]

test_cases = []
for data in test_data:
    test_case = LLMTestCase(
                input=data.get("original_text", None),
                actual_output=data.get("input", None)
              )
    test_cases.append(test_case)

import pytest
from deepeval.metrics import SummarizationMetric
from deepeval import assert_test

@pytest.mark.parametrize(
    "test_case",
    test_cases,
)
def test_summarization(test_case: LLMTestCase):
    metric = SummarizationMetric()
    assert_test(test_case, [metric])

deepeval test run test_summarization.py

touch test_responsibility.py
# test_responsibility.py

from deepeval.metrics import BiasMetric, ToxicityMetric
from deepeval.test_case import LLMTestCase
from deepeval import assert_test

bias_metric = BiasMetric()
toxicity_metric = ToxicityMetric()

def test_responsibility():
    test_case = LLMTestCase(input="...", actual_output="...")
    assert_test(test_case, [bias_metric, toxicity_metric])

deepeval test run test_responsibility.py

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine if the actual output is correct with regard to the expected output.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    strict_mode=True
)
test_case = LLMTestCase(
  input="The dog chased the cat up the tree. Who went up the tree?",
  actual_output="Cat",
  expected_output="The cat"
)

correctness_metric.measure(test_case)
print(correctness_metric.is_successful())

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

similarity_metric = GEval(
    name="Similarity",
    criteria="Determine if the actual output is semantically similar to the expected output.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]
)
test_case = LLMTestCase(
  input="The dog chased the cat up the tree. Who went up the tree?",
  actual_output="Cat",
  expected_output="The cat"
)

similarity_metric.measure(test_case)
print(similarity_metric.is_successful())

llm_tests  
├── test_summarization.py  
├── test_code_generation.py  
├── test_performance.py  
├── test_responsibility.py  
...

deepeval test run llm_tests

name: LLM Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install dependencies
        run: |
          pip install deepeval
      - name: Run tests
        run: |
          deepeval test run llm_tests

LLM 大模型测试策略与方法

LLM 测试即对大模型测试

单元测试

更多推荐文章

相关免费在线工具

功能测试

回归测试

性能测试

可靠性测试

数据驱动测试

准确性测试

相似度测试

虚构性测试

测试 LLMs 的最佳实践

LLM 评估指标

CI/CD 中的自动化测试

总结与展望

更多推荐文章

相关免费在线工具

LLM 大模型测试策略与方法

LLM 测试即对大模型测试

单元测试

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

功能测试

回归测试

性能测试

可靠性测试

数据驱动测试

准确性测试

相似度测试

虚构性测试

测试 LLMs 的最佳实践

LLM 评估指标

CI/CD 中的自动化测试

总结与展望

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具