from unstructured.partition.auto import partition
# 解析 PDF 文件
elements = partition(filename="example.pdf")
for element in elements[:5]:
print(f"{element.category}: {element.text}")
输出示例:
Title: Introduction
NarrativeText: This is the first paragraph...
ListItem: - Item 1
返回 Element 对象列表,包含类别(Title、NarrativeText、ListItem、Table 等)和元数据。
特定文件类型分区:
from unstructured.partition.pdf import partition_pdf
# 高分辨率解析(包含表格)
elements = partition_pdf(filename="example.pdf", strategy="hi_res")
说明:
strategy="hi_res" 使用计算机视觉和 OCR 提取表格,适合复杂 PDF。
需安装 tesseract 和 poppler。
3.2 清理(Cleaning Bricks)
移除无关内容,如样板文本、标点或句段碎片。
from unstructured.cleaners.core import clean, remove_punctuation
text = "Hello, World!!! This is a test..."
cleaned_text = clean(text, lowercase=True) # 转换为小写并清理
cleaned_text = remove_punctuation(cleaned_text) # 移除标点print(cleaned_text) # 输出:hello world this is a test
说明:
支持清理操作:小写转换、移除标点、删除样板文本等。
可与分区结果结合,处理提取的 Element 文本。
3.3 格式化(Staging Bricks)
将数据格式化为下游任务的输入,如 JSON 或 LLM 训练数据。
from unstructured.staging.base import convert_to_dict
# 转换为 JSON
elements = partition(filename="example.docx")
json_data = convert_to_dict(elements)
print(json_data[:2]) # 输出前两个元素
输出示例:
[{"type":"Title","text":"Introduction","metadata":{...}},{"type":"NarrativeText","text":"This is the first paragraph...","metadata":{...}}]
说明:
convert_to_dict 将元素列表转为 JSON,适合 LLM 或数据分析。
支持其他格式化函数,如 stage_for_transformers(Hugging Face 集成)。
from unstructured_client import UnstructuredClient
client = UnstructuredClient(api_key_auth="your_api_key")
withopen("example.pdf", "rb") as f:
response = client.general.partition(file=f, strategy="hi_res")
print(response.elements[:2])
说明:
需安装 unstructured-client。
Serverless API 提供更高性能,适合生产环境。
3.6 Docker 部署
在 Docker 容器中运行 unstructured。
# 在容器内运行 Python 脚本from unstructured.partition.auto import partition
elements = partition(filename="/data/example.pdf")
print([str(el) for el in elements[:5]])
from unstructured.partition.auto import partition
from unstructured.cleaners.core import clean, remove_punctuation
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json
# 配置日志(使用 loguru)from loguru import logger
logger.add("app.log", rotation="1 MB", level="INFO")
# 解析 PDF
logger.info("Starting PDF processing")
try:
elements = partition(filename="sample.pdf", strategy="hi_res")
except Exception as e:
logger.exception("Failed to process PDF")
raise# 清理文本
cleaned_elements = []
for element in elements:
text = clean(element.text, lowercase=True)
text = remove_punctuation(text)
cleaned_elements.append({"type": element.category, "text": text})
logger.info("Text cleaning completed")
# 转换为 JSON
json_data = convert_to_dict(cleaned_elements)
withopen("output.json", "w") as f:
json.dump(json_data, f, indent=2)
logger.info("JSON output saved")
# LangChain 集成
loader = UnstructuredLoader(file_path="sample.pdf", strategy="hi_res")
docs = loader.load()
logger.info(f"Loaded {len(docs)} documents")
# 打印前 100 个字符print(docs[0].page_content[:100])
输出示例(app.log):
2025-05-09T01:33:56.123 | INFO | Starting PDF processing2025-05-09T01:33:57.124 | INFO | Text cleaning completed2025-05-09T01:33:57.125 | INFO | JSON output saved2025-05-09T01:33:57.126 | INFO | Loaded 1 documents