Python 使用 python-docx 提取 Word 文档数据详解 | 极客日志

Python

Python 使用 python-docx 提取 Word 文档数据详解

介绍利用 Python 的 python-docx 库提取 Word 文档（.docx）中文本、表格、图片及超链接等数据的方法。涵盖库安装、基础遍历、表格转 CSV、图片保存及页眉页脚处理等步骤，并提供完整示例代码与错误处理方案，适用于自动化办公与数据挖掘场景。

晚风叙旧发布于 2026/3/25更新于 2026/5/2329 浏览

Python 提取 Word 文档中各种数据的详细方法

本文介绍如何利用 Python 高效提取 Word 文档（.docx 格式）中的数据。Word 文档常用于存储文本、表格、图片、列表等结构化信息，通过自动化提取，可以提升数据分析和处理的效率。基于 Python 库 python-docx，逐步讲解安装、基础操作和高级技巧。

1. 准备工作：安装和导入库

首先，安装 python-docx 库。使用 pip 命令：

pip install python-docx

导入库并加载 Word 文档：

from docx import Document
# 加载 Word 文档，假设文件名为"example.docx"
doc = Document("example.docx")

如果文档路径不确定，可以使用相对路径或绝对路径。确保文件存在，否则会抛出异常。

2. 提取文本内容

文本是 Word 文档的核心，包括段落、标题和正文。python-docx 将文档视为段落集合。

提取所有段落文本：

# 遍历所有段落，提取文本
all_text = []
for paragraph in doc.paragraphs:
    all_text.append(paragraph.text)
# 打印提取结果
print("文档全文：")
for text in all_text:
    print(text)

说明：paragraphs 属性返回一个列表，每个元素代表一个段落。paragraph.text 获取纯文本内容。
适用场景：提取报告正文、文章内容等。

提取特定标题： Word 文档使用样式标记标题（如'标题 1'、'标题 2'）。提取所有标题：

headings = []
for paragraph in doc.paragraphs:
    if paragraph.style.name.startswith('Heading'):
        # 检查样式名以"Heading"开头
        headings.append(paragraph.text)
print("文档标题：")
for heading in headings:
    print(heading)

技巧：使用 style.name 判断样式，支持自定义标题级别。

相关免费在线工具

curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online
HTML转Markdown
将 HTML 片段转为 GitHub Flavored Markdown，支持标题、列表、链接、代码块与表格等；浏览器内处理，可链接预填。在线工具，HTML转Markdown在线工具，online
JSON 压缩
通过删除不必要的空白来缩小和压缩JSON。在线工具，JSON 压缩在线工具，online

# 假设文档中有至少一个表格
table = doc.tables[0]
# 获取第一个表格
table_data = []
for row in table.rows:
    row_data = []
    for cell in row.cells:
        row_data.append(cell.text)
    # 提取单元格文本
    table_data.append(row_data)
print("表格数据：")
for row in table_data:
    print(row)

import csv
for i, table in enumerate(doc.tables):
    table_data = []
    for row in table.rows:
        row_data = [cell.text for cell in row.cells]
        table_data.append(row_data)
    # 保存到 CSV 文件
    with open(f"table_{i}.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerows(table_data)
print(f"已提取并保存{len(doc.tables)}个表格到 CSV 文件。")

from docx.shared import Inches
import os
from PIL import Image
import io

# 创建目录保存图片
os.makedirs("extracted_images", exist_ok=True)
for rel in doc.part.rels.values():
    if "image" in rel.reltype:
        # 检查关系类型是否为图片
        image_data = rel.target_part.blob
        # 获取图片二进制数据
        img = Image.open(io.BytesIO(image_data))
        img.save(f"extracted_images/image_{rel.rId}.png")
        # 保存为 PNG
print(f"已提取并保存{len([rel for rel in doc.part.rels.values() if 'image' in rel.reltype])}张图片。")

lists = []
for paragraph in doc.paragraphs:
    if paragraph.style.name == "List Paragraph":
        # 检查列表样式
        lists.append(paragraph.text)
print("列表内容：")
for item in lists:
    print(f"- {item}")

hyperlinks = []
for paragraph in doc.paragraphs:
    for run in paragraph.runs:
        # 遍历段落中的文本块
        if run.hyperlink:
            hyperlinks.append((run.text, run.hyperlink.target))
# 保存链接文本和目标 URL
print("超链接：")
for text, url in hyperlinks:
    print(f"文本：'{text}', URL: {url}")

header_text = []
footer_text = []
for section in doc.sections:
    header = section.header
    for paragraph in header.paragraphs:
        header_text.append(paragraph.text)
    footer = section.footer
    for paragraph in footer.paragraphs:
        footer_text.append(paragraph.text)
print("页眉内容:", header_text)
print("页脚内容:", footer_text)

def extract_word_data(file_path):
    doc = Document(file_path)
    results = {
        "text": [p.text for p in doc.paragraphs],
        "tables": [[[cell.text for cell in row.cells] for row in table.rows] for table in doc.tables],
        "images": len([rel for rel in doc.part.rels.values() if "image" in rel.reltype]),
        "hyperlinks": [(run.text, run.hyperlink.target) for p in doc.paragraphs for run in p.runs if run.hyperlink]
    }
    return results

# 使用示例
data = extract_word_data("example.docx")
print("提取结果:", data)

try:
    doc = Document("example.docx")
except Exception as e:
    print(f"加载失败：{e}")

for paragraph in doc.paragraphs:
    for run in paragraph.runs:
        font = run.font
        print(f"文本：'{run.text}', 字体：{font.name}, 大小：{font.size}, 颜色：{font.color.rgb}")

import sqlite3
conn = sqlite3.connect("data.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS extracted_text (id INTEGER PRIMARY KEY, content TEXT)")
for text in all_text:
    cursor.execute("INSERT INTO extracted_text (content) VALUES (?)", (text,))
conn.commit()

Python 使用 python-docx 提取 Word 文档数据详解

Python 提取 Word 文档中各种数据的详细方法

1. 准备工作：安装和导入库

2. 提取文本内容

更多推荐文章

相关免费在线工具

3. 提取表格数据

4. 提取图片和图像

5. 提取其他元素

6. 高级技巧和注意事项

7. 结论

更多推荐文章

相关免费在线工具

Python 使用 python-docx 提取 Word 文档数据详解

Python 提取 Word 文档中各种数据的详细方法

1. 准备工作：安装和导入库

2. 提取文本内容

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 提取表格数据

4. 提取图片和图像

5. 提取其他元素

6. 高级技巧和注意事项

7. 结论

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具