基于大模型与知识库的 Code Review 实践 | 极客日志

PythonAI大前端算法

基于大模型与知识库的 Code Review 实践

综述由AI生成基于开源大模型与私有知识库的 Code Review 实践方案。针对代码安全合规及人工评审效率低的问题，通过私有化部署开源大模型（如 ChatGLM2、Llama2），结合向量数据库构建内部知识库，实现代码变更的自动化审查。系统基于 Gitlab CI 集成，支持自定义飞书文档知识库，利用文本向量化与相似度搜索增强上下文理解。实现了评论精准定位至变更行，确保数据不出内网且无调用次数限制。方案包含模型选型、知识库设计、Prompt 指令优化及 Diff 解析逻辑，旨在提升研发代码质量与评审效率。

修罗发布于 2025/2/7更新于 2026/6/1334 浏览

背景

💡 想法源于在一次 Code Review 时，向 Claude 询问哪种写法代码更优雅得来。当时就想能不能让 AI 帮我们辅助做 Code Review？

Code Review Idea

痛点

信息安全合规问题：公司内代码直接调 ChatGPT / Claude 会有安全/合规问题，为了使用 ChatGPT / Claude 需要对代码脱敏，只提供抽象逻辑，这往往更花时间。

三星引入 ChatGPT 不到 20 天，被曝发生 3 次芯片机密泄露。

低质量代码耗费时间：达人业务每天至少 10~20 个 MR 需要 CR，虽然提交时 MR 经过单测 + Lint 过滤了一些低级错误，但还有些问题（代码合理性、经验、MR 相关业务逻辑等）需要花费大量时间，最后可以先经过自动化 CR，再进行人工 CR，可大大提升 CR 效率！

团队 Code Review 规范缺少执行：大部分团队的 Code Review 停留在文档纸面上，成员之间口口相传，并没有一个工具根据规范来严格执行。

介绍

一句话介绍就是：基于开源大模型 + 知识库的 Code Review 实践，类似一个代码评审助手（CR Copilot）。

System Overview

特性

符合公司安全规范，所有代码数据不出内网，所有推理过程均在内网完成。

🌈 开箱即用：基于 Gitlab CI，仅 10 几行配置完成接入，即可对 MR 进行 CR。

🔒 数据安全：基于开源大模型做私有化部署，隔离外网访问，确保代码 CR 过程仅在内网环境下完成。

♾ 无调用次数限制：部署在内部平台，只有 GPU 租用成本。

📚 自定义知识库：CR 助手基于提供的飞书文档进行学习，将匹配部分作为上下文，结合代码变更进行 CR，这将大大提升 CR 的准确度，也更符合团队自身的 CR 规范。

🎯 评论到变更行：CR 助手将结果评论到变更代码行上，通过 Gitlab CI 通知，更及时获取 CR 助手给出的评论。

名词解释

名词	释义
CR / Code Review	越来越多的企业都要求研发团队在代码的开发过程中要进行 CodeReview（简称 CR），在保障代码质量的同时，促进团队成员之间的交流，提高代码水平。
llm / 大规模语言模型	大规模语言模型 (Large Language Models,LLMs) 是自然语言处理中使用大量文本数据训练的神经网络模型，可以生成高质量的文本并理解语言。如 GPT、BERT 等。
AIGC	利用 NLP、NLG、计算机视觉、语音技术等生成文字、图像、视频等内容。全称是人工智能生成/创作内容（Artificial Intelligence Generated Content）；是继 UGC，PGC 后，利用人工智能技术，自动生成内容的生产方式；AIGC 底层技术的发展，驱动围绕不同内容类型（模态）和垂直领域的应用加速涌现。
LLaMA

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

向量数据库	URL	GitHub Star	Language	Cloud
chroma	https://github.com/chroma-core/chroma	8.5K	Python	❌
milvus	https://github.com/milvus-io/milvus	22.8K	Go/Python/C++	✅
pinecone	https://www.pinecone.io/	❌	❌	✅
qdrant	https://github.com/qdrant/qdrant	12.7K	Rust	✅
typesense	https://github.com/typesense/typesense	14.4K	C++	❌
weaviate	https://github.com/weaviate/weaviate	7.4K	Go	✅

内容	数据源
React 官方文档	https://react.dev/learn
TypeScript 官方文档	https://www.typescriptlang.org/docs/
Rspack 官方文档	https://www.rspack.dev/zh/guide/introduction.html
Garfish	https://github.com/web-infra-dev/garfish
公司内 Go / Python / Rust 等编程规范	…

prefix = "user: " if model == "chatglm2" else "<s>Human: " 
suffix = "assistant(用中文): let's think step by step." if model == "chatglm2" else "\n</s><s>Assistant(用中文): let's think step by step."
return f"""{prefix}根据这段 {language} 代码，列出关于这段 {language} 代码用到的工具库、模块包。   {language} 代码:   ```{language}   {source_code}   ```请注意：   - 知识列表中的每一项都不要有类似或者重复的内容   - 列出的内容要和代码密切相关   - 最少列出 3 个，最多不要超过 6 个   - 知识列表中的每一项要具体   - 列出列表，不要对工具库、模块做解释   - 输出中文   {suffix}"""

# llama2
f"""Human: please briefly review the {language}code changes by learning the provided context to do a brief code review feedback and suggestions. if any bug risk and improvement suggestion are welcome(no more than six)   <context>   {context}   </context>      <code_changes>   {diff_code}   </code_changes>\n</s><s>Assistant: """

# chatglm2
f"""user: 【指令】请根据所提供的上下文信息来简要审查{language} 变更代码，进行简短的代码审查和建议，变更代码有任何 bug 缺陷和改进建议请指出（不超过 6 条）。   【已知信息】：{context}      【变更代码】：{diff_code}      assistant: """

import re

def parse_diff(input):
    if not input:
        return []
    if not isinstance(input, str) or re.match(r"^\s+$", input):
        return []

    lines = input.split("\n")
    if not lines:
        return []

    result = []
    current_file = None
    current_chunk = None
    deleted_line_counter = 0
    added_line_counter = 0
    current_file_changes = None

    def normal(line):
        nonlocal deleted_line_counter, added_line_counter
        current_chunk["changes"].append({
            "type": "normal",
            "normal": True,
            "ln1": deleted_line_counter,
            "ln2": added_line_counter,
            "content": line
        })
        deleted_line_counter += 1
        added_line_counter += 1
        current_file_changes["old_lines"] -= 1
        current_file_changes["new_lines"] -= 1

    def start(line):
        nonlocal current_file, result
        current_file = {
            "chunks": [],
            "deletions": 0,
            "additions": 0
        }
        result.append(current_file)

    def to_num_of_lines(number):
        return int(number) if number else 1

    def chunk(line, match):
        nonlocal current_file, current_chunk, deleted_line_counter, added_line_counter, current_file_changes
        if not current_file:
            start(line)
        old_start, old_num_lines, new_start, new_num_lines = match.group(1), match.group(2), match.group(
            3), match.group(4)

        deleted_line_counter = int(old_start)
        added_line_counter = int(new_start)
        current_chunk = {
            "content": line,
            "changes": [],
            "old_start": int(old_start),
            "old_lines": to_num_of_lines(old_num_lines),
            "new_start": int(new_start),
            "new_lines": to_num_of_lines(new_num_lines),
        }
        current_file_changes = {
            "old_lines": to_num_of_lines(old_num_lines),
            "new_lines": to_num_of_lines(new_num_lines),
        }
        current_file["chunks"].append(current_chunk)

    def delete(line):
        nonlocal deleted_line_counter
        if not current_chunk:
            return

        current_chunk["changes"].append({
            "type": "del",
            "del": True,
            "ln": deleted_line_counter,
            "content": line
        })
        deleted_line_counter += 1
        current_file["deletions"] += 1
        current_file_changes["old_lines"] -= 1

    def add(line):
        nonlocal added_line_counter
        if not current_chunk:
            return
        current_chunk["changes"].append({
            "type": "add",
            "add": True,
            "ln": added_line_counter,
            "content": line
        })
        added_line_counter += 1
        current_file["additions"] += 1
        current_file_changes["new_lines"] -= 1

    def eof(line):
        if not current_chunk:
            return
        most_recent_change = current_chunk["changes"][-1]
        current_chunk["changes"].append({
            "type": most_recent_change["type"],
            most_recent_change["type"]: True,
            "ln1": most_recent_change["ln1"],
            "ln2": most_recent_change["ln2"],
            "ln": most_recent_change["ln"],
            "content": line
        })

    header_patterns = [
        (re.compile(r"^@@\s+-(\d+),?(\d+)?\s++(\d+),?(\d+)?\s@@"), chunk)
    ]

    content_patterns = [
        (re.compile(r"^\ No newline at end of file$"), eof),
        (re.compile(r"^-"), delete),
        (re.compile(r"^+"), add),
        (re.compile(r"^\s+"), normal)
    ]

    def parse_content_line(line):
        nonlocal current_file_changes
        for pattern, handler in content_patterns:
            match = re.search(pattern, line)
            if match:
                handler(line)
                break
        if current_file_changes["old_lines"] == 0 and current_file_changes["new_lines"] == 0:
            current_file_changes = None

    def parse_header_line(line):
        for pattern, handler in header_patterns:
            match = re.search(pattern, line)
            if match:
                handler(line, match)
                break

    def parse_line(line):
        if current_file_changes:
            parse_content_line(line)
        else:
            parse_header_line(line)

    for line in lines:
        parse_line(line)

    return result

基于大模型与知识库的 Code Review 实践

背景

痛点

介绍

特性

名词解释

更多推荐文章

相关免费在线工具

实现思路

流程图

系统架构

LLMs / 开源大模型选型

知识库设计

为什么需要知识库？

怎样找到相关度高的知识？

Text Embeddings（文本向量化）

Vector Stores（向量存储）

Similarity Search（相似性搜索）

加载知识库

官方文档 - 知识库（内置）

自定义知识库 - 飞书文档（自定义）

Prompt 指令设计

代码 summary 总结指令

CR 指令

评论到变更代码行

一点感想

更多推荐文章

相关免费在线工具

基于大模型与知识库的 Code Review 实践

背景

痛点

介绍

特性

名词解释

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

实现思路

流程图

系统架构

LLMs / 开源大模型选型

知识库设计

为什么需要知识库？

怎样找到相关度高的知识？

Text Embeddings（文本向量化）

Vector Stores（向量存储）

Similarity Search（相似性搜索）

加载知识库

官方文档 - 知识库（内置）

自定义知识库 - 飞书文档（自定义）

Prompt 指令设计

代码 summary 总结指令

CR 指令

评论到变更代码行

一点感想

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具