Python 库 unstructured：高效转换 PDF、Word 等非结构化数据 | 极客日志

PythonAI算法

Python 库 unstructured：高效转换 PDF、Word 等非结构化数据

unstructured 库用于将 PDF、Word 等非结构化数据转换为结构化元素，支持分区、清理及格式化。内容涵盖安装配置、核心 API 用法及与 LangChain 集成方案，适用于 RAG 系统及机器学习数据预处理。

内存管理发布于 2026/3/21更新于 2026/7/2040 浏览

Python 库 unstructured：高效转换 PDF、Word 等非结构化数据

unstructured 是一个强大的 Python 开源库，专门用于处理和预处理非结构化数据（如 PDF、Word 文档、HTML、图片等），将其转换为结构化格式。这对于下游的机器学习（ML）或大语言模型（LLM）任务至关重要。它提供模块化的组件（称为'bricks'），支持文档分区、清理和格式化，广泛应用于数据管道、RAG（检索增强生成）系统和文档分析。

安装与环境配置

基础依赖

首先确保你的 Python 版本在 3.8+（推荐 3.9+）。核心依赖包括 beautifulsoup4（HTML 解析）、lxml（XML 处理）和 nltk（文本处理）。

pip install unstructured

系统级依赖

如果你需要处理 PDF 或图片，本地环境还需要安装额外的工具：

Tesseract：用于 OCR 识别。
Poppler：用于 PDF 处理。
Pandoc：处理 EPUB、RTF 等格式。

以 Linux 为例，安装 libmagic 用于文件类型检测：

# Ubuntu/Debian
sudo apt-get install libmagic1
# macOS
brew install libmagic

对于特定文档类型，可以安装可选依赖以减少开销：

pip install "unstructured[docx]"
pip install "unstructured[local-inference]"  # 包含 PDF 和图片处理的完整依赖

Docker 部署

如果不想污染本地环境，可以使用官方 Docker 镜像：

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured bash

核心功能与用法

unstructured 的核心逻辑是通过'bricks'处理文档，主要分为分区（Partitioning）、清理（Cleaning）和格式化（Staging）三大类。

1. 分区（Partitioning）

这是最关键的一步，将文档拆分为标题、段落、列表、表格等结构化元素。partition 函数会自动检测文件类型并调用相应的处理器。

from unstructured.partition.auto import partition

# 自动检测并解析 PDF 文件
elements = partition(filename="example.pdf")

for element in elements[:]:
    ()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

from unstructured.partition.pdf import partition_pdf

# strategy="hi_res" 使用计算机视觉和 OCR 提取表格
elements = partition_pdf(filename="example.pdf", strategy="hi_res")

from unstructured.cleaners.core import clean, remove_punctuation

text = "Hello, World!!! This is a test..."
cleaned_text = clean(text, lowercase=True)
cleaned_text = remove_punctuation(cleaned_text)
print(cleaned_text)  # 输出：hello world this is a test

from unstructured.staging.base import convert_to_dict

json_data = convert_to_dict(elements)
print(json_data[:2])

from langchain_unstructured import UnstructuredLoader

# 本地加载
loader = UnstructuredLoader(file_path="example.pdf")
docs = loader.load()

# 使用 Serverless API 加载（需安装 unstructured-client 并获取密钥）
# loader = UnstructuredLoader(
#     file_path="example.pdf",
#     api_key="your_api_key",
#     strategy="hi_res"
# )
# docs = loader.load()

print(docs[0].page_content[:100])

from unstructured.partition.auto import partition
from unstructured.cleaners.core import clean, remove_punctuation
from unstructured.staging.base import convert_to_dict
import json
from loguru import logger

logger.add("app.log", rotation="1 MB", level="INFO")

try:
    logger.info("Starting PDF processing")
    # 解析 PDF，使用 hi_res 策略确保表格准确
    elements = partition(filename="sample.pdf", strategy="hi_res")
    
    cleaned_elements = []
    for element in elements:
        text = clean(element.text, lowercase=True)
        text = remove_punctuation(text)
        cleaned_elements.append({"type": element.category, "text": text})
    
    logger.info("Text cleaning completed")
    
    # 转换为 JSON 保存
    json_data = convert_to_dict(cleaned_elements)
    with open("output.json", "w") as f:
        json.dump(json_data, f, indent=2)
    
    logger.info("JSON output saved")
except Exception as e:
    logger.exception("Failed to process PDF")
    raise

Python 库 unstructured：高效转换 PDF、Word 等非结构化数据

Python 库 unstructured：高效转换 PDF、Word 等非结构化数据

安装与环境配置

基础依赖

系统级依赖

Docker 部署

核心功能与用法

1. 分区（Partitioning）

更多推荐文章

相关免费在线工具

2. 清理（Cleaning）

3. 格式化（Staging）

集成与应用场景

LangChain 集成

实际工作流示例

性能与局限性

总结

更多推荐文章

相关免费在线工具

Python 库 unstructured：高效转换 PDF、Word 等非结构化数据

Python 库 unstructured：高效转换 PDF、Word 等非结构化数据

安装与环境配置

基础依赖

系统级依赖

Docker 部署

核心功能与用法

1. 分区（Partitioning）

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. 清理（Cleaning）

3. 格式化（Staging）

集成与应用场景

LangChain 集成

实际工作流示例

性能与局限性

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具