基于大模型与知识库的 Code Review 实践 | 极客日志

PythonAI算法

基于大模型与知识库的 Code Review 实践

一种基于开源大模型结合私有化知识库实现自动化代码审查（Code Review）的方案。针对企业代码安全合规及人工 CR 效率低的问题，该方案通过内网部署 LLM、构建向量数据库检索内部规范文档，并利用 LangChain 框架集成 Gitlab CI 流程。系统支持自定义知识库学习团队规范，将评论精准定位至变更行，在保障数据安全的前提下提升 CR 质量与效率。同时详细阐述了模型选型、知识库构建、Prompt 设计及 Diff 解析等关键技术细节，并提供部署与维护建议。

监控大屏发布于 2025/2/7更新于 2026/7/2645 浏览

背景

💡 想法源于在一次 Code Review 时，向大语言模型询问哪种写法代码更优雅得来。当时就想能不能让 AI 帮我们辅助做 Code Review？

基于大模型与知识库的 Code Review 实践

痛点

信息安全合规问题：公司内代码直接调 ChatGPT / Claude 会有安全/合规问题，为了使用外部大模型需要对代码脱敏，只提供抽象逻辑，这往往更花时间。

三星引入 ChatGPT 不到 20 天，被曝发生 3 次芯片机密泄露。

低质量代码耗费时间：业务每天至少 10~20 个 MR 需要 CR，虽然提交时 MR 经过单测 + Lint 过滤了一些低级错误，但还有些问题（代码合理性、经验、MR 相关业务逻辑等）需要花费大量时间。可以先经过自动化 CR，再进行人工 CR，可大大提升 CR 效率！

团队 Code Review 规范缺少执行：大部分团队的 Code Review 停留在文档纸面上，成员之间口口相传，并没有一个工具根据规范来严格执行。

介绍

一句话介绍就是：基于开源大模型 + 知识库的 Code Review 实践，类似一个代码评审助手（CR Copilot）。

基于大模型与知识库的 Code Review 实践

特性

符合公司安全规范，所有代码数据不出内网，所有推理过程均在内网完成。

🌈 开箱即用：基于 Gitlab CI，仅 10 几行配置完成接入，即可对 MR 进行 CR。

🔒 数据安全：基于开源大模型做私有化部署，隔离外网访问，确保代码 CR 过程仅在内网环境下完成。

♾ 无调用次数限制：部署在内部平台，只有 GPU 租用成本。

📚 自定义知识库：CR 助手基于提供的飞书文档进行学习，将匹配部分作为上下文，结合代码变更进行 CR，这将大大提升 CR 的准确度，也更符合团队自身的 CR 规范。

🎯 评论到变更行：CR 助手将结果评论到变更代码行上，通过 Gitlab CI 通知，更及时获取 CR 助手给出的评论。

名词解释

名词	释义
CR / Code Review	越来越多的企业都要求研发团队在代码的开发过程中要进行 CodeReview（简称 CR），在保障代码质量的同时，促进团队成员之间的交流，提高代码水平。
llm / 大规模语言模型	大规模语言模型 (Large Language Models,LLMs) 是自然语言处理中使用大量文本数据训练的神经网络模型，可以生成高质量的文本并理解语言。如 GPT、BERT 等。
AIGC	利用 NLP、NLG、计算机视觉、语音技术等生成文字、图像、视频等内容。全称是人工智能生成/创作内容（Artificial Intelligence Generated Content）；是继 UGC，PGC 后，利用人工智能技术，自动生成内容的生产方式；AIGC 底层技术的发展，驱动围绕不同内容类型（模态）和垂直领域的应用加速涌现。

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

向量数据库	URL	GitHub Star	Language	Cloud
chroma	https://github.com/chroma-core/chroma	8.5K	Python	❌
milvus	https://github.com/milvus-io/milvus	22.8K	Go/Python/C++	✅
pinecone	https://www.pinecone.io/	❌	❌	✅
qdrant	https://github.com/qdrant/qdrant	12.7K	Rust	✅
typesense	https://github.com/typesense/typesense	14.4K	C++	❌
weaviate	https://github.com/weaviate/weaviate	7.4K	Go	✅

内容	数据源
React 官方文档	https://react.dev/learn
TypeScript 官方文档	https://www.typescriptlang.org/docs/
Rspack 官方文档	https://www.rspack.dev/zh/guide/introduction.html
Garfish	https://github.com/web-infra-dev/garfish
公司内 Go / Python / Rust 等编程规范	…

prefix = "user: " if model == "chatglm2" else "<s>Human: "   
suffix = "assistant(用中文): let's think step by step." if model == "chatglm2" else "\n</s><s>Assistant(用中文): let's think step by step."
      return f"""{prefix}根据这段 {language} 代码，列出关于这段 {language} 代码用到的工具库、模块包。   {language} 代码:   ```{language}   {source_code}   ```请注意：   - 知识列表中的每一项都不要有类似或者重复的内容   - 列出的内容要和代码密切相关   - 最少列出 3 个，最多不要超过 6 个   - 知识列表中的每一项要具体   - 列出列表，不要对工具库、模块做解释   - 输出中文   {suffix}"""

# llama2   
f"""Human: please briefly review the {language}code changes by learning the provided context to do a brief code review feedback and suggestions. if any bug risk and improvement suggestion are welcome(no more than six)   <context>   {context}   </context>      <code_changes>   {diff_code}   </code_changes>\n</s><s>Assistant: """      
# chatglm2   
f"""user: 【指令】请根据所提供的上下文信息来简要审查{language} 变更代码，进行简短的代码审查和建议，变更代码有任何 bug 缺陷和改进建议请指出（不超过 6 条）。   【已知信息】：{context}      【变更代码】：{diff_code}      assistant: """

import re      
def parse_diff(input):
       if not input:
           return []
       if not isinstance(input, str) or re.match(r"^\s+$", input):
           return []
          lines = input.split("\n")
       if not lines:
           return []
          result = []
       current_file = None
       current_chunk = None
       deleted_line_counter = 0
       added_line_counter = 0
       current_file_changes = None
          def normal(line):
           nonlocal deleted_line_counter, added_line_counter
           current_chunk["changes"].append({
               "type": "normal",
               "normal": True,
               "ln1": deleted_line_counter,
               "ln2": added_line_counter,
               "content": line
           })
           deleted_line_counter += 1
           added_line_counter += 1
           current_file_changes["old_lines"] -= 1
           current_file_changes["new_lines"] -= 1
          def start(line):
           nonlocal current_file, result
           current_file = {
               "chunks": [],
               "deletions": 0,
               "additions": 0
           }
           result.append(current_file)
          def to_num_of_lines(number):
           return int(number) if number else 1
          def chunk(line, match):
           nonlocal current_file, current_chunk, deleted_line_counter, added_line_counter, current_file_changes
           if not current_file:
               start(line)
           old_start, old_num_lines, new_start, new_num_lines = match.group(1), match.group(2), match.group(
               3), match.group(4)
              deleted_line_counter = int(old_start)
           added_line_counter = int(new_start)
           current_chunk = {
               "content": line,
               "changes": [],
               "old_start": int(old_start),
               "old_lines": to_num_of_lines(old_num_lines),
               "new_start": int(new_start),
               "new_lines": to_num_of_lines(new_num_lines),
           }
           current_file_changes = {
               "old_lines": to_num_of_lines(old_num_lines),
               "new_lines": to_num_of_lines(new_num_lines),
           }
           current_file["chunks"].append(current_chunk)
          def delete(line):
           nonlocal deleted_line_counter
           if not current_chunk:
               return
              current_chunk["changes"].append({
               "type": "del",
               "del": True,
               "ln": deleted_line_counter,
               "content": line
           })
           deleted_line_counter += 1
           current_file["deletions"] += 1
           current_file_changes["old_lines"] -= 1
          def add(line):
           nonlocal added_line_counter
           if not current_chunk:
               return
           current_chunk["changes"].append({
               "type": "add",
               "add": True,
               "ln": added_line_counter,
               "content": line
           })
           added_line_counter += 1
           current_file["additions"] += 1
           current_file_changes["new_lines"] -= 1
          def eof(line):
           if not current_chunk:
               return
           most_recent_change = current_chunk["changes"][-1]
           current_chunk["changes"].append({
               "type": most_recent_change["type"],
               most_recent_change["type"]: True,
               "ln1": most_recent_change["ln1"],
               "ln2": most_recent_change["ln2"],
               "ln": most_recent_change["ln"],
               "content": line
           })
          header_patterns = [
           (re.compile(r"^@@\s+-(\d+),?(\d+)?\s++(\d+),?(\d+)?\s@"), chunk)
       ]
          content_patterns = [
           (re.compile(r"^\ No newline at end of file$"), eof),
           (re.compile(r"^-"), delete),
           (re.compile(r"^+"), add),
           (re.compile(r"^\s+"), normal)
       ]
          def parse_content_line(line):
           nonlocal current_file_changes
           for pattern, handler in content_patterns:
               match = re.search(pattern, line)
               if match:
                   handler(line)
                   break
           if current_file_changes["old_lines"] == 0 and current_file_changes["new_lines"] == 0:
               current_file_changes = None
          def parse_header_line(line):
           for pattern, handler in header_patterns:
               match = re.search(pattern, line)
               if match:
                   handler(line, match)
                   break
          def parse_line(line):
           if current_file_changes:
               parse_content_line(line)
           else:
               parse_header_line(line)
          for line in lines:
           parse_line(line)
          return result

基于大模型与知识库的 Code Review 实践

背景

痛点

介绍

特性

名词解释

更多推荐文章

相关免费在线工具

实现思路

流程图

系统架构

LLMs / 开源大模型选型

知识库设计

为什么需要知识库？

怎样找到相关度高的知识？

Text Embeddings（文本向量化）

Vector Stores（向量存储）

Similarity Search（相似性搜索）

加载知识库

官方文档 - 知识库（内置）

自定义知识库 - 飞书文档（自定义）

Prompt 指令设计

代码 summary 总结指令

CR 指令

评论到变更代码行

部署与维护建议

硬件资源规划

知识库更新策略

反馈闭环机制

一点感想

更多推荐文章

相关免费在线工具

基于大模型与知识库的 Code Review 实践

背景

痛点

介绍

特性

名词解释

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

实现思路

流程图

系统架构

LLMs / 开源大模型选型

知识库设计

为什么需要知识库？

怎样找到相关度高的知识？

Text Embeddings（文本向量化）

Vector Stores（向量存储）

Similarity Search（相似性搜索）

加载知识库

官方文档 - 知识库（内置）

自定义知识库 - 飞书文档（自定义）

Prompt 指令设计

代码 summary 总结指令

CR 指令

评论到变更代码行

部署与维护建议

硬件资源规划

知识库更新策略

反馈闭环机制

一点感想

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具