Python unstructured 库：非结构化数据解析与结构化处理指南 | 极客日志

PythonAI算法

Python unstructured 库：非结构化数据解析与结构化处理指南

unstructured 是 Python 开源库，用于将 PDF、Word、HTML 等非结构化文档解析为结构化元素。支持分区、清理和格式化功能，可生成 LLM 友好的 JSON 格式。适用于 RAG 系统、数据预处理及文档分析。提供本地处理与 Serverless API 选项，兼容 LangChain 集成。需注意 OCR 依赖配置及特定格式解析的局限性。

灰度发布发布于 2026/3/22更新于 2026/5/2520 浏览

为什么需要 unstructured

在处理机器学习或大语言模型（LLM）任务时，我们常遇到 PDF、Word、HTML 等非结构化文档。unstructured 是一个 Python 开源库，专门用于将这些复杂文档拆分为标题、段落、列表、表格等结构化元素，并转换为 JSON 格式，方便下游 RAG 系统或模型训练使用。

它采用模块化设计（称为'bricks'），支持分区、清理和格式化，覆盖 25+ 种文件格式，包括 TXT、PDF、DOCX、PPTX、HTML、图片等。核心库开源（Apache 2 许可证），同时也提供 Serverless API 供生产环境使用。

安装与环境配置

基础依赖

首先确保 Python 版本在 3.8 以上（推荐 3.9+）。

pip install unstructured

如果只需要特定格式支持，可以按需安装，例如 DOCX：

pip install "unstructured[docx]"

完整安装包含本地推理依赖（如 PDF 和图片处理）：

pip install "unstructured[local-inference]"

系统级依赖

本地处理 PDF 和图片通常需要额外工具，这一步最容易踩坑：

Tesseract：用于 OCR 识别。Mac 用户可用 brew install tesseract，Linux 需自行编译或安装包管理器版本。
Poppler：用于 PDF 解析。参考 pdf2image 文档安装。
Pandoc：处理 EPUB、RTF 等格式，建议版本 2.14.2+。
libmagic：文件类型检测。Mac 用 brew install libmagic，Ubuntu 用 sudo apt-get install libmagic1。

验证安装是否成功：

import unstructured
print(unstructured.__version__)
# 示例输出：0.16.17

若使用 Serverless API，需单独安装客户端：

pip install unstructured-client

Docker 用户可直接拉取官方镜像简化环境：

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured bash

核心功能实战

unstructured 的核心流程是：分区（Partitioning）→ 清理（Cleaning）→ 格式化（Staging）。

1. 分区：将文档拆解为元素

partition 函数会自动检测文件类型并调用对应的解析器。返回的是 Element 对象列表，包含类别（如 Title、NarrativeText）和文本内容。

from unstructured.partition.auto  partition


elements = partition(filename=)

 element  elements[:]:
    ()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

Title: Introduction
NarrativeText: This is the first paragraph...
ListItem: - Item 1

from unstructured.partition.pdf import partition_pdf

# hi_res 策略利用计算机视觉提取表格
elements = partition_pdf(filename="example.pdf", strategy="hi_res")

from unstructured.cleaners.core import clean, remove_punctuation

text = "Hello, World!!! This is a test..."
cleaned_text = clean(text, lowercase=True)  # 转小写
cleaned_text = remove_punctuation(cleaned_text)  # 去标点

print(cleaned_text)
# 输出：hello world this is a test

from unstructured.staging.base import convert_to_dict

json_data = convert_to_dict(elements)
print(json_data[:2])

[
  {"type": "Title", "text": "Introduction", "metadata": {...}},
  {"type": "NarrativeText", "text": "This is the first paragraph...", "metadata": {...}}
]

from langchain_unstructured import UnstructuredLoader

# 本地加载
loader = UnstructuredLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:100])

# 或使用 Serverless API（需 API Key）
loader_api = UnstructuredLoader(
    file_path="example.pdf",
    api_key="your_api_key",
    strategy="hi_res"
)
docs_api = loader_api.load()

from unstructured_client import UnstructuredClient

client = UnstructuredClient(api_key_auth="your_api_key")
with open("example.pdf", "rb") as f:
    response = client.general.partition(file=f, strategy="hi_res")
    print(response.elements[:2])

from unstructured.partition.auto import partition

# 挂载数据卷后访问 /data/example.pdf
elements = partition(filename="/data/example.pdf")
print([str(el) for el in elements[:5]])

依赖管理：本地处理 PDF/图片必须安装 tesseract 和 poppler，否则解析会失败。若只处理单一格式，建议安装特定依赖包以减少开销。
性能权衡：strategy="hi_res" 虽然精度高，但计算成本较大。对纯文本 PDF 可使用默认策略提速。
解析局限：DOCX 列表项有时会被误识为标题，大型文档可能缺乏父子关系标注，影响 LLM 上下文理解，建议后处理验证。
隐私设置：库包含轻量级分析 Ping，可通过环境变量禁用：
```
export DO_NOT_TRACK=true
export SCARF_NO_ANALYTICS=true
```
替代方案：若追求极致速度，可关注 extractous；若专注布局分析，LayoutParser 也是不错的选择。

from unstructured.partition.auto import partition
from unstructured.cleaners.core import clean, remove_punctuation
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json
from loguru import logger

logger.add("app.log", rotation="1 MB", level="INFO")

try:
    logger.info("Starting PDF processing")
    # 解析 PDF，使用 hi_res 策略提取表格
    elements = partition(filename="sample.pdf", strategy="hi_res")
except Exception as e:
    logger.exception("Failed to process PDF")
    raise

# 清理文本
cleaned_elements = []
for element in elements:
    text = clean(element.text, lowercase=True)
    text = remove_punctuation(text)
    cleaned_elements.append({"type": element.category, "text": text})

logger.info("Text cleaning completed")

# 保存为 JSON
json_data = convert_to_dict(cleaned_elements)
with open("output.json", "w") as f:
    json.dump(json_data, f, indent=2)

logger.info("JSON output saved")

# LangChain 集成
loader = UnstructuredLoader(file_path="sample.pdf", strategy="hi_res")
docs = loader.load()
logger.info(f"Loaded {len(docs)} documents")
print(docs[0].page_content[:100])

Python unstructured 库：非结构化数据解析与结构化处理指南

为什么需要 unstructured

安装与环境配置

基础依赖

系统级依赖

核心功能实战

1. 分区：将文档拆解为元素

更多推荐文章

相关免费在线工具

2. 清理：净化文本数据

3. 格式化：生成下游输入

与 LangChain 集成

生产环境部署

Serverless API

Docker 部署

注意事项与优化

综合示例

参考资料

更多推荐文章

相关免费在线工具

Python unstructured 库：非结构化数据解析与结构化处理指南

为什么需要 unstructured

安装与环境配置

基础依赖

系统级依赖

核心功能实战

1. 分区：将文档拆解为元素

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. 清理：净化文本数据

3. 格式化：生成下游输入

与 LangChain 集成

生产环境部署

Serverless API

Docker 部署

注意事项与优化

综合示例

参考资料

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具