Python OCR 文字识别：pytesseract 安装配置与实战 | 极客日志

PythonAI算法

Python OCR 文字识别：pytesseract 安装配置与实战

pytesseract 是 Python 中常用的 OCR 库，依赖 Tesseract 引擎。详细讲解了 Windows 下的手动安装流程，包括引擎下载、环境变量配置、语言包添加及路径指定。内容涵盖常见报错排查、图像预处理技巧、中英文识别实战案例（身份证、截图、验证码、PDF 转换），以及获取文字位置、置信度检测等进阶用法。同时对比了不同 OCR 方案的优缺点，帮助开发者根据场景选择合适方案。

林间仙子发布于 2026/3/26更新于 2026/7/2030 浏览

Python OCR 文字识别：pytesseract 安装配置与实战

pytesseract 是 Python 中常用的光学字符识别（OCR）库，它封装了 Tesseract OCR 引擎。要在 Windows 上顺利使用，核心在于正确安装 Tesseract 引擎并配置好环境变量。

版本要求

pytesseract 本身只是一个 Python 封装库，必须依赖底层的 Tesseract OCR 引擎才能工作。

组件	推荐版本	Python 版本	说明
pytesseract	0.3.10+	3.7+	Python 封装库
Tesseract-OCR	5.x	-	OCR 识别引擎
中文语言包	chi_sim	-	简体中文识别（可选）
英文语言包	eng	-	英文识别（默认自带）

注意：只安装 pytesseract 是不够的，必须先安装 Tesseract OCR 引擎。

常见安装问题排查

在实际使用中，新手常遇到以下几类错误，提前了解能节省不少调试时间。

1. Tesseract 引擎未安装

报错信息通常包含 TesseractNotFoundError: tesseract is not installed。这说明系统里根本没有 Tesseract 程序，或者 Python 找不到它。

2. 路径未配置

即使安装了 Tesseract，如果没加入系统 PATH，Python 依然无法调用。此时需要手动指定 tesseract_cmd 的路径。

3. 中文识别乱码

如果图片包含中文但输出乱码或空白，通常是缺少 chi_sim.traineddata 语言包。需单独下载并放入 tessdata 目录。

4. 识别准确率低

识别结果错误多，往往不是库的问题，而是图片质量差、对比度低或未做预处理。建议先对图像进行灰度化、二值化处理。

手动安装指南

虽然市面上有一些一键安装工具，但掌握手动安装流程更能理解底层原理，也方便后续维护。

步骤一：安装 Tesseract OCR 引擎

从 GitHub 官方仓库下载最新安装包（如 tesseract-ocr-w64-setup-5.3.3.exe）。

安装时请注意勾选 "Additional language data"，并在列表中选择 "Chinese (Simplified)"。这一步至关重要，否则无法识别中文。记住安装路径，默认为 C:\Program Files\Tesseract-OCR。

步骤二：配置环境变量（可选）

为了在命令行直接调用，可以将 Tesseract 路径添加到系统环境变量中：

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

pip install pytesseract pillow

import pytesseract

# 指定 Tesseract 安装路径
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

C:\Program Files\Tesseract-OCR\tessdata\

import pytesseract
from PIL import Image

# 如果没加环境变量，请取消下面这行的注释
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# 测试英文识别
img = Image.open('test_eng.jpg')
text = pytesseract.image_to_string(img)
print(f"英文识别结果：\n{text}")

# 测试中文识别
img_cn = Image.open('test_chn.jpg')
text_cn = pytesseract.image_to_string(img_cn, lang='chi_sim')
print(f"中文识别结果：\n{text_cn}")

import pytesseract
print(pytesseract.get_languages())
# 输出示例：['chi_sim', 'eng', ...]

import pytesseract
from PIL import Image
import cv2

# 读取身份证图片
img = cv2.imread('身份证.jpg')

# 图像预处理（提高识别率）
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (5, 5), 0)
_, binary = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# OCR 识别
text = pytesseract.image_to_string(binary, lang='chi_sim')
print(f"身份证信息：\n{text}")

# 简单提取特定字段
lines = text.split('\n')
for line in lines:
    if '姓名' in line:
        print(f"姓名：{line.split('姓名')[-1].strip()}")
    if '身份证号' in line:
        print(f"身份证号：{line.split('身份证号')[-1].strip()}")

import pytesseract
from PIL import ImageGrab

# 截取屏幕
screenshot = ImageGrab.grab()
screenshot.save('screenshot.png')

# 识别文字
text = pytesseract.image_to_string(screenshot, lang='chi_sim+eng')
print(f"截图文字：\n{text}")

# 保存到文件
with open('extracted_text.txt', 'w', encoding='utf-8') as f:
    f.write(text)

import pytesseract
from PIL import Image
import cv2
import numpy as np

# 读取验证码
img = cv2.imread('captcha.jpg')

# 预处理
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

# 去噪
kernel = np.ones((2, 2), np.uint8)
opening = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

# 识别（只允许数字和字母）
custom_config = r'--oem 3 --psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
text = pytesseract.image_to_string(opening, config=custom_config)
print(f"验证码：{text.strip()}")

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

def pdf_to_text(pdf_path, output_txt):
    # PDF 转图片
    images = convert_from_path(pdf_path, dpi=300)
    full_text = ""
    
    for i, img in enumerate(images):
        print(f"处理第{i+1}页...")
        text = pytesseract.image_to_string(img, lang='chi_sim+eng')
        full_text += f"\n========== 第{i+1}页 ==========\n"
        full_text += text
    
    # 保存结果
    with open(output_txt, 'w', encoding='utf-8') as f:
        f.write(full_text)
    print(f"完成！文本已保存到 {output_txt}")

# 使用
# pdf_to_text('扫描文档.pdf', '提取文本.txt')

import cv2

img = cv2.imread('test.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)
denoised = cv2.fastNlMeansDenoising(gray, None, 10, 7, 21)
resized = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)

import pytesseract

# --psm: Page Segmentation Mode
# 6: 单一文本块（默认）
# 7: 单行文本
# 8: 单个词
# 11: 稀疏文本

# --oem: OCR Engine Mode
# 0: 仅使用 Legacy 引擎
# 1: 仅使用神经网络 LSTM 引擎
# 3: 默认（自动）

# 示例：单行文字识别
text = pytesseract.image_to_string(img, lang='chi_sim', config='--psm 7')

# 示例：只识别数字
text = pytesseract.image_to_string(img, config='--psm 6 -c tessedit_char_whitelist=0123456789')

# 纯英文
text = pytesseract.image_to_string(img, lang='eng')

# 纯中文
text = pytesseract.image_to_string(img, lang='chi_sim')

# 中英混合
text = pytesseract.image_to_string(img, lang='chi_sim+eng')

# 繁体中文
text = pytesseract.image_to_string(img, lang='chi_tra')

方案	优点	缺点	适用场景
pytesseract	免费、离线、轻量	识别率一般、手写字差	印刷体文字
百度 OCR API	识别率高、支持手写	收费、需联网	商业项目
PaddleOCR	识别率高、免费	模型大、配置复杂	高精度需求
EasyOCR	多语言支持、易用	速度较慢	多语言场景

import pytesseract
from PIL import Image

img = Image.open('test.jpg')
data = pytesseract.image_to_data(img, lang='chi_sim', output_type=pytesseract.Output.DICT)

for i, text in enumerate(data['text']):
    if text.strip():
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        print(f"文字：{text}, 位置：({x}, {y}), 大小：{w}x{h}")

import pytesseract
from PIL import Image

img = Image.open('scan.jpg')
pdf = pytesseract.image_to_pdf_or_hocr(img, lang='chi_sim', extension='pdf')

with open('output.pdf', 'wb') as f:
    f.write(pdf)

import pytesseract
from PIL import Image

img = Image.open('test.jpg')
data = pytesseract.image_to_data(img, lang='chi_sim', output_type=pytesseract.Output.DICT)

for i, text in enumerate(data['text']):
    confidence = data['conf'][i]
    if confidence != -1 and text.strip():
        print(f"文字：{text}, 置信度：{confidence}%")

Python OCR 文字识别：pytesseract 安装配置与实战

Python OCR 文字识别：pytesseract 安装配置与实战

版本要求

常见安装问题排查

1. Tesseract 引擎未安装

2. 路径未配置

3. 中文识别乱码

4. 识别准确率低

手动安装指南

步骤一：安装 Tesseract OCR 引擎

步骤二：配置环境变量（可选）

更多推荐文章

相关免费在线工具

步骤三：安装 Python 依赖

步骤四：在代码中指定路径

步骤五：补充中文语言包

验证安装

实用案例

案例 1：身份证信息提取

案例 2：截图文字提取

案例 3：验证码识别

案例 4：批量 PDF 转文字

提高识别准确率

1. 图像预处理

2. 配置 OCR 参数

3. 选择合适的语言

常见问题 FAQ

常用功能扩展

获取文字位置

保存为 PDF

置信度检测

更多推荐文章

相关免费在线工具

Python OCR 文字识别：pytesseract 安装配置与实战

Python OCR 文字识别：pytesseract 安装配置与实战

版本要求

常见安装问题排查

1. Tesseract 引擎未安装

2. 路径未配置

3. 中文识别乱码

4. 识别准确率低

手动安装指南

步骤一：安装 Tesseract OCR 引擎

步骤二：配置环境变量（可选）

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

步骤三：安装 Python 依赖

步骤四：在代码中指定路径

步骤五：补充中文语言包

验证安装

实用案例

案例 1：身份证信息提取

案例 2：截图文字提取

案例 3：验证码识别

案例 4：批量 PDF 转文字

提高识别准确率

1. 图像预处理

2. 配置 OCR 参数

3. 选择合适的语言

常见问题 FAQ

常用功能扩展

获取文字位置

保存为 PDF

置信度检测

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具