Python unstructured 库：处理非结构化数据并转换为结构化格式 | 极客日志

PythonAI算法

Python unstructured 库：处理非结构化数据并转换为结构化格式

unstructured 是一个 Python 开源库，用于将 PDF、Word、HTML 等非结构化数据转换为结构化元素（如标题、段落）。它支持多种文件类型，提供分区、清理和格式化功能，可集成到 RAG 系统或 LLM 任务中。安装需配置 Python 环境及可选依赖（如 Tesseract、Poppler），支持本地处理、Docker 部署及 Serverless API。通过 LangChain 集成可简化文档加载流程，适用于数据预处理、合同分析及个人 AI 助手等场景。

萤火微光发布于 2026/3/30更新于 2026/5/2333 浏览

unstructured 是一个 Python 开源库，设计用于处理和预处理非结构化数据（如 PDF、Word 文档、HTML、图片等），将其转换为结构化格式，方便下游机器学习（ML）或大语言模型（LLM）任务。它提供模块化的组件（称为'bricks'），支持文档分区、清理和格式化，广泛应用于数据管道、RAG（Retrieval-Augmented Generation）系统和文档分析。

1. unstructured 库的作用

非结构化数据处理：将复杂文档（如 PDF、DOCX、HTML）拆分为结构化元素（如标题、段落、列表、表格）。
模块化设计：提供分区（partitioning）、清理（cleaning）和格式化（staging）组件，灵活构建数据处理管道。
多格式支持：支持 25+ 文件类型，包括 TXT、PDF、DOCX、PPTX、HTML、JPG、PNG、EML、CSV、EPUB 等。
AI/LLM 集成：优化数据预处理，生成 LLM 友好的 JSON 格式，适用于 RAG、数据标注和模型训练。
本地和云端支持：提供本地处理和 Serverless API（需 API 密钥），兼顾性能和易用性。
开源与商业产品：核心库开源（Apache 2 许可证），另有付费 API 和平台增强功能。

2. 安装与环境要求

Python 版本：支持 Python 3.8+（推荐 3.9+）。
核心依赖：
- beautifulsoup4：HTML 解析。
- lxml：XML 处理。
- nltk：文本处理。
- 可选：tesseract（OCR）、poppler（PDF 处理）、pandoc（EPUB/RTF）。
安装命令：

pip install unstructured

系统依赖（本地处理 PDF/图片）：
- Tesseract：用于 OCR，安装指南：https://tesseract-ocr.github.io/
- Poppler：PDF 处理，参考 pdf2image 文档：https://pdf2image.readthedocs.io/
- Pandoc：处理 EPUB、RTF 等，需版本 2.14.2+。

验证安装：

import unstructured
print(unstructured.__version__)
# 示例输出：0.16.17

libmagic：文件类型检测（Linux/Mac 需安装）。

# Mac brew install libmagic
# Ubuntu sudo apt-get install libmagic1

使用 Serverless API：

pip install unstructured-client

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

pip install "unstructured[docx]"

pip install "unstructured[local-inference]"

pip install unstructured

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured bash

from unstructured.partition.auto import partition

# 解析 PDF 文件
elements = partition(filename="example.pdf")
for element in elements[:5]:
    print(f"{element.category}: {element.text}")

Title: Introduction
NarrativeText: This is the first paragraph...
ListItem: - Item 1

from unstructured.partition.pdf import partition_pdf

# 高分辨率解析（包含表格）
elements = partition_pdf(filename="example.pdf", strategy="hi_res")

from unstructured.cleaners.core import clean, remove_punctuation

text = "Hello, World!!! This is a test..."
cleaned_text = clean(text, lowercase=True)  # 转换为小写并清理
cleaned_text = remove_punctuation(cleaned_text)  # 移除标点
print(cleaned_text)  # 输出：hello world this is a test

from unstructured.staging.base import convert_to_dict

# 转换为 JSON
elements = partition(filename="example.docx")
json_data = convert_to_dict(elements)
print(json_data[:2])  # 输出前两个元素

[{"type":"Title","text":"Introduction","metadata":{...}},{"type":"NarrativeText","text":"This is the first paragraph...","metadata":{...}}]

from langchain_unstructured import UnstructuredLoader

# 本地加载
loader = UnstructuredLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:100])  # 输出提取的文本

# 使用 Serverless API
loader = UnstructuredLoader(
    file_path="example.pdf",
    api_key="your_api_key",
    strategy="hi_res"
)
docs = loader.load()

from unstructured_client import UnstructuredClient

client = UnstructuredClient(api_key_auth="your_api_key")
with open("example.pdf", "rb") as f:
    response = client.general.partition(file=f, strategy="hi_res")
print(response.elements[:2])

# 在容器内运行 Python 脚本
from unstructured.partition.auto import partition

elements = partition(filename="/data/example.pdf")
print([str(el) for el in elements[:5]])

from unstructured.partition.auto import partition
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json

# 解析文档
elements = partition(filename="report.pdf", strategy="hi_res")
json_data = convert_to_dict(elements)

# 保存为 JSON
with open("output.json", "w") as f:
    json.dump(json_data, f, indent=2)

# 加载到 LangChain
loader = UnstructuredLoader(file_path="report.pdf")
docs = loader.load()

# 假设使用 LLM 进行问答
from langchain.llms import OpenAI
llm = OpenAI(api_key="your_openai_key")
response = llm(f"Summarize: {docs[0].page_content[:500]}")
print(response)

from unstructured.partition.auto import partition
from unstructured.cleaners.core import clean, remove_punctuation
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json

# 配置日志（使用 loguru）
from loguru import logger
logger.add("app.log", rotation="1 MB", level="INFO")

# 解析 PDF
logger.info("Starting PDF processing")
try:
    elements = partition(filename="sample.pdf", strategy="hi_res")
except Exception as e:
    logger.exception("Failed to process PDF")
    raise

# 清理文本
cleaned_elements = []
for element in elements:
    text = clean(element.text, lowercase=True)
    text = remove_punctuation(text)
    cleaned_elements.append({"type": element.category, "text": text})

logger.info("Text cleaning completed")

# 转换为 JSON
json_data = convert_to_dict(cleaned_elements)
with open("output.json", "w") as f:
    json.dump(json_data, f, indent=2)

logger.info("JSON output saved")

# LangChain 集成
loader = UnstructuredLoader(file_path="sample.pdf", strategy="hi_res")
docs = loader.load()
logger.info(f"Loaded {len(docs)} documents")

# 打印前 100 个字符
print(docs[0].page_content[:100])

2025-05-09T01:33:56.123 | INFO | Starting PDF processing
2025-05-09T01:33:57.124 | INFO | Text cleaning completed
2025-05-09T01:33:57.125 | INFO | JSON output saved
2025-05-09T01:33:57.126 | INFO | Loaded 1 documents

Python unstructured 库：处理非结构化数据并转换为结构化格式

1. unstructured 库的作用

2. 安装与环境要求

更多推荐文章

相关免费在线工具

3. 核心功能与用法

3.1 分区（Partitioning Bricks）

3.2 清理（Cleaning Bricks）

3.3 格式化（Staging Bricks）

3.4 LangChain 集成

3.5 使用 Serverless API

3.6 Docker 部署

4. 性能与特点

5. 实际应用场景

6. 部署与扩展

7. 注意事项

8. 综合示例

9. 资源与文档

更多推荐文章

相关免费在线工具

Python unstructured 库：处理非结构化数据并转换为结构化格式

1. unstructured 库的作用

2. 安装与环境要求

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 核心功能与用法

3.1 分区（Partitioning Bricks）

3.2 清理（Cleaning Bricks）

3.3 格式化（Staging Bricks）

3.4 LangChain 集成

3.5 使用 Serverless API

3.6 Docker 部署

4. 性能与特点

5. 实际应用场景

6. 部署与扩展

7. 注意事项

8. 综合示例

9. 资源与文档

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具