Python OCR 文字识别：pytesseract 安装与配置指南 | 极客日志

PythonAI算法

Python OCR 文字识别：pytesseract 安装与配置指南

Python OCR 库 pytesseract 的安装配置与使用指南。涵盖环境版本要求、Tesseract 引擎手动安装步骤、中文语言包配置、常见报错解决（如路径未配置、乱码）。提供身份证识别、截图提取、验证码识别及 PDF 转文字等实战案例。包含图像预处理技巧、OCR 参数优化（PSM/OEM）、多语言支持及与其他 OCR 方案的对比分析。适用于印刷体文字识别场景，支持离线使用。

苹果系统发布于 2026/3/29更新于 2026/5/2727 浏览

Python OCR 文字识别：pytesseract 安装与配置指南

pytesseract 是 Python 的 OCR（光学字符识别）库，可以从图片中提取文字。Windows 上使用需要先安装 Tesseract OCR 引擎。

版本要求

pytesseract 依赖 Tesseract OCR 引擎：

组件	推荐版本	Python 版本	说明
pytesseract	0.3.10	3.7+	Python 封装库
Tesseract-OCR	5.x	-	OCR 识别引擎
中文语言包	chi_sim	-	简体中文识别（可选）
英文语言包	eng	-	英文识别（默认自带）

注意：pytesseract 只是封装库，必须先安装 Tesseract OCR 引擎才能使用。

安装中可能遇到的问题

问题 1：Tesseract 引擎未安装

import pytesseract
pytesseract.image_to_string('test.jpg')
# TesseractNotFoundError: tesseract is not installed or it's not in your PATH

只装了 pytesseract，没装 Tesseract OCR 引擎。

问题 2：路径未配置

import pytesseract
pytesseract.image_to_string('test.jpg')
# pytesseract.pytesseract.TesseractNotFoundError

Tesseract 安装了，但 Python 找不到，需要手动指定路径。

问题 3：中文识别乱码

text = pytesseract.image_to_string('中文图片.jpg')
print(text)
# 输出：乱码或空白

没有安装中文语言包 chi_sim.traineddata。

问题 4：识别准确率低

识别结果错误很多，可能是图片质量差、没有预处理。

手动安装

步骤 1：安装 Tesseract OCR 引擎

下载地址：Tesseract OCR 引擎下载地址

选择最新版本（如 tesseract-ocr-w64-setup-5.3.3.20231005.exe）下载并安装。

安装时注意：

勾选"Additional language data" → 选择"Chinese - Simplified"（简体中文）

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

pip install pytesseract pillow

import pytesseract
# 指定 Tesseract 安装路径
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

C:\Program Files\Tesseract-OCR\tessdata\

import pytesseract
from PIL import Image

# 配置路径（如果没加环境变量）
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# 测试英文识别
img = Image.open('test_eng.jpg')
text = pytesseract.image_to_string(img)
print(f"英文识别结果：\n{text}")

# 测试中文识别
img_cn = Image.open('test_chn.jpg')
text_cn = pytesseract.image_to_string(img_cn, lang='chi_sim')
print(f"中文识别结果：\n{text_cn}")

import pytesseract
# 查看已安装的语言包
print(pytesseract.get_languages())
# ['chi_sim', 'eng', ...]

import pytesseract
from PIL import Image
import cv2

# 读取身份证图片
img = cv2.imread('身份证.jpg')

# 图像预处理（提高识别率）
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# 灰度化
blur = cv2.GaussianBlur(gray, (5, 5), 0)
# 降噪
_, binary = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# 二值化

# OCR 识别
text = pytesseract.image_to_string(binary, lang='chi_sim')
print(f"身份证信息：\n{text}")

# 提取特定字段（示例）
lines = text.split('\n')
for line in lines:
    if '姓名' in line:
        print(f"姓名：{line.split('姓名')[-1].strip()}")
    if '身份证号' in line:
        print(f"身份证号：{line.split('身份证号')[-1].strip()}")

import pytesseract
from PIL import ImageGrab

# 截取屏幕
screenshot = ImageGrab.grab()
screenshot.save('screenshot.png')

# 识别文字
text = pytesseract.image_to_string(screenshot, lang='chi_sim+eng')
print(f"截图文字：\n{text}")

# 保存到文件
with open('extracted_text.txt', 'w', encoding='utf-8') as f:
    f.write(text)

import pytesseract
from PIL import Image
import cv2
import numpy as np

# 读取验证码
img = cv2.imread('captcha.jpg')

# 预处理（验证码识别关键）
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

# 去噪
kernel = np.ones((2, 2), np.uint8)
opening = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

# 识别（只允许数字和字母）
custom_config = r'--oem 3 --psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
text = pytesseract.image_to_string(opening, config=custom_config)
print(f"验证码：{text.strip()}")

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

def pdf_to_text(pdf_path, output_txt):
    # PDF 转图片
    images = convert_from_path(pdf_path, dpi=300)
    full_text = ""
    for i, img in enumerate(images):
        print(f"处理第{i+1}页...")
        text = pytesseract.image_to_string(img, lang='chi_sim+eng')
        full_text += f"\n========== 第{i+1}页 ==========\n"
        full_text += text
    # 保存结果
    with open(output_txt, 'w', encoding='utf-8') as f:
        f.write(full_text)
    print(f"完成！文本已保存到 {output_txt}")

# 使用
# pdf_to_text('扫描文档.pdf', '提取文本.txt')

import cv2

# 读取图片
img = cv2.imread('test.jpg')

# 灰度化
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# 二值化
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)

# 去噪
denoised = cv2.fastNlMeansDenoising(gray, None, 10, 7, 21)

# 调整大小（放大可能提高识别率）
resized = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)

import pytesseract

# 常用配置
# --psm: Page Segmentation Mode（页面分割模式）
# 6: 单一文本块（默认）
# 7: 单行文本
# 8: 单个词
# 11: 稀疏文本
# --oem: OCR Engine Mode
# 0: 仅使用 Legacy 引擎
# 1: 仅使用神经网络 LSTM 引擎
# 3: 默认（自动）

# 示例：单行文字识别
text = pytesseract.image_to_string(img, lang='chi_sim', config='--psm 7')

# 示例：只识别数字
text = pytesseract.image_to_string(img, config='--psm 6 -c tessedit_char_whitelist=0123456789')

# 纯英文
text = pytesseract.image_to_string(img, lang='eng')

# 纯中文
text = pytesseract.image_to_string(img, lang='chi_sim')

# 中英混合
text = pytesseract.image_to_string(img, lang='chi_sim+eng')

# 繁体中文
text = pytesseract.image_to_string(img, lang='chi_tra')

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

方案	优点	缺点	适用场景
pytesseract	免费、离线、轻量	识别率一般、手写字差	印刷体文字
百度 OCR API	识别率高、支持手写	收费、需联网、有调用限制	商业项目
PaddleOCR	识别率高、免费	模型大、配置复杂	高精度需求
EasyOCR	多语言支持、简单易用	速度较慢	多语言场景

import pytesseract
from PIL import Image

img = Image.open('test.jpg')

# 获取文字及位置信息
data = pytesseract.image_to_data(img, lang='chi_sim', output_type=pytesseract.Output.DICT)
for i, text in enumerate(data['text']):
    if text.strip():
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        print(f"文字：{text}, 位置：({x}, {y}), 大小：{w}x{h}")

import pytesseract
from PIL import Image

img = Image.open('scan.jpg')
pdf = pytesseract.image_to_pdf_or_hocr(img, lang='chi_sim', extension='pdf')
with open('output.pdf', 'wb') as f:
    f.write(pdf)

import pytesseract
from PIL import Image

img = Image.open('test.jpg')
data = pytesseract.image_to_data(img, lang='chi_sim', output_type=pytesseract.Output.DICT)
for i, text in enumerate(data['text']):
    confidence = data['conf'][i]
    if confidence != -1 and text.strip():
        print(f"文字：{text}, 置信度：{confidence}%")

Python OCR 文字识别：pytesseract 安装与配置指南

Python OCR 文字识别：pytesseract 安装与配置指南

版本要求

安装中可能遇到的问题

手动安装

步骤 1：安装 Tesseract OCR 引擎

更多推荐文章

相关免费在线工具

步骤 2：配置环境变量（可选）

步骤 3：安装 pytesseract

步骤 4：配置 Tesseract 路径

步骤 5：下载中文语言包（如未安装）

验证安装

基础测试

检查支持的语言

实用案例

案例 1：身份证识别

案例 2：截图文字提取

案例 3：验证码识别

案例 4：批量 PDF 转文字

提高识别准确率

1. 图像预处理

2. 配置 OCR 参数

3. 选择合适的语言

常见问题

常用功能

获取文字位置

保存为 PDF

置信度检测

更多推荐文章

相关免费在线工具

Python OCR 文字识别：pytesseract 安装与配置指南

Python OCR 文字识别：pytesseract 安装与配置指南

版本要求

安装中可能遇到的问题

手动安装

步骤 1：安装 Tesseract OCR 引擎

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

步骤 2：配置环境变量（可选）

步骤 3：安装 pytesseract

步骤 4：配置 Tesseract 路径

步骤 5：下载中文语言包（如未安装）

验证安装

基础测试

检查支持的语言

实用案例

案例 1：身份证识别

案例 2：截图文字提取

案例 3：验证码识别

案例 4：批量 PDF 转文字

提高识别准确率

1. 图像预处理

2. 配置 OCR 参数

3. 选择合适的语言

常见问题

常用功能

获取文字位置

保存为 PDF

置信度检测

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具