Python 十大实用技巧：爬虫、自动化与数据处理 | 极客日志

PythonAI算法

Python 十大实用技巧：爬虫、自动化与数据处理

综述由AI生成Python 在爬虫、数据分析、文件管理及图像处理方面的十大实用技巧。涵盖网页抓取、Excel 表格处理、批量图片下载、数据可视化图表绘制、词云生成、文件重命名以及图片水印、饱和度调整和尺寸缩放等场景。文章提供了完整的代码示例，包括环境配置、异常处理及最佳实践，旨在帮助开发者利用 Python 实现办公自动化与数据处理的高效解决方案。

1739658202发布于 2025/2/7更新于 2026/6/232 浏览

1. Python 简介

1.1 什么是 Python？

Python 是一种高层次的结合了解释性、编译性、互动性和面向对象的脚本语言。其设计哲学强调代码的可读性和简洁的语法。

解释型语言：开发过程中无需编译环节，类似于 PHP 和 Perl，直接运行源代码。
交互式语言：支持在 Python 提示符 >>> 后直接执行代码，便于快速测试。
面向对象：支持封装、继承和多态等编程技术。
初学者友好：语法接近英语关键字，适合初级程序员入门，广泛应用于文字处理、Web 浏览器及游戏开发。

1.2 发展历史

Python 由 Guido van Rossum 于 1989 年底在荷兰国家数学与计算机科学研究所设计。它融合了 ABC、Modula-3、C、C++、Algol-68、SmallTalk 等多种语言的特性。Python 遵循 GPL 协议，目前由核心开发团队维护。

Python 2.0 (2000 年)：引入完整垃圾回收机制，支持 Unicode。
Python 3.0 (2008 年)：重大升级，不完全兼容旧版，常被称为 Py3k。
Python 2.7：最后一个 Python 2.x 版本，兼容部分 3.x 语法。

1.3 主要特点

易于学习：关键字少，结构清晰，语法定义明确。
易于阅读与维护：代码缩进规范，逻辑清晰。
标准库丰富：跨平台兼容性好（UNIX, Windows, Macintosh）。
可扩展性：可嵌入 C/C++ 程序，关键性能代码可用 C/C++ 编写。
数据库支持：提供主流商业数据库接口。
GUI 编程：支持创建跨平台的图形用户界面。
可嵌入：可将 Python 嵌入其他程序，赋予用户脚本化能力。

2. Python 典型应用场景

网络爬虫：抓取网站数据、表格、学习资料。
数据分析与可视化：生成图表，处理 Excel/CSV 数据。
自动化办公：批量重命名文件、处理图片、自动化流程。
Web 开发：使用 Flask/Django/Bottle 等框架。
科学计算：结合 NumPy、SciPy 进行数值计算。

3. 实战案例详解

3.1 爬取文档与资讯

场景：收集指定网站的文章标题和链接。目标：从 https://zkaoy.com/sions/exam 获取数据。依赖：urllib3, bs4

第一步：下载网页并保存

import urllib3

def download_content(url):
    """下载网页内容"""
    http = urllib3.PoolManager()
    :
        response = http.request(, url)
         response.data.decode()
     Exception  e:
        ()
         

 ():
    
      content:
        
     (filename, , encoding=)  fo:
        fo.write(content)

url = 
result = download_content(url)
save_to_file(, result)

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

from bs4 import BeautifulSoup

def create_doc_from_filename(filename):
    """读取 HTML 文件并返回 BeautifulSoup 对象"""
    with open(filename, "r", encoding='utf-8') as fo:
        html_content = fo.read()
    return BeautifulSoup(html_content, 'html.parser')

doc = create_doc_from_filename("tips1.html")
post_list = doc.find_all("div", class_="post-info")
for post in post_list:
    links = post.find_all("a")
    if len(links) > 1:
        link_tag = links[1]
        title = link_tag.text.strip()
        href = link_tag.get("href")
        print(f"标题：{title}, 链接：{href}")

import urllib3
import pandas as pd

def download_html(url):
    http = urllib3.PoolManager()
    response = http.request("GET", url)
    return response.data.decode('utf-8')

url = "http://fx.cmbchina.com/Hq/"
html_content = download_html(url)

# 读取所有表格
table_list = pd.read_html(html_content)

# 假设我们需要第二个表格（索引为 1）
if len(table_list) > 1:
    df = table_list[1]
    # 保存为 Excel
    df.to_excel("tips2.xlsx", index=False)
    print("表格已保存至 tips2.xlsx")
else:
    print("未找到合适的表格")

import os
from bs4 import BeautifulSoup
from urllib.request import urlretrieve

def download_images(url, output_dir):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    html = download_content(url)
    if not html:
        return
    
    save_to_file("temp_page.html", html)
    doc = create_doc_from_filename("temp_page.html")
    images = doc.find_all("img")
    
    count = 0
    for img in images:
        src = img.get("src")
        if src and src.startswith("http"):
            filename = src.split("/")[-1] or f"image_{count}.jpg"
            try:
                urlretrieve(src, os.path.join(output_dir, filename))
                count += 1
            except Exception as e:
                print(f"下载失败 {filename}: {e}")
    print(f"共下载 {count} 张图片")

url = "https://www.duitang.com/search/?kw=ins%E9%A3%8E%E8%83%8C%E6%99%AF%E5%9B%BE&type=feed"
download_images(url, "tips_3")

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False   # 用来正常显示负号

# 读取数据
df = pd.read_excel("tips2.xlsx")
# 过滤表头行
df = df[df.index > 0]

sns.set()
figure = plt.figure(figsize=(10, 5))
ax = figure.add_subplot(1, 1, 1)

# 绘图
plt.xlabel("类别", fontsize=14)
plt.ylabel("卖出价", fontsize=14)
plt.title("外汇情况统计", fontsize=14)

# 确保数值列正确转换
df[3] = pd.to_numeric(df[3], errors='coerce')

plt.bar(x=df[0], height=df[3].values, color='#1f77b4')
plt.xticks(rotation=90)
plt.show()

import jieba
import wordcloud
import matplotlib.pyplot as plt

# 读取文本
text = ""
with open("news_title.txt", "r", encoding="utf-8") as fo:
    text = fo.read()

# 分词
split_list = jieba.lcut(text)
final_text = " ".join(split_list)

# 停用词
stopwords = ["的", "是", "了", "在", "我"]

# 生成词云
wc = wordcloud.WordCloud(
    font_path="/System/Library/Fonts/PingFang.ttc",  # Windows 需改为对应字体路径
    width=1000,
    height=700,
    background_color="white",
    max_words=100,
    stopwords=stopwords
)
wc.generate(final_text)

plt.imshow(wc)
plt.axis("off")
plt.show()

import os

def batch_rename_files(folder_path, prefix="图片_"):
    files = sorted(os.listdir(folder_path))
    idx = 0
    for item in files:
        if item.lower().endswith(('.jpg', '.png')):
            old_path = os.path.join(folder_path, item)
            new_name = f"{prefix}{idx}.jpg"
            new_path = os.path.join(folder_path, new_name)
            try:
                os.rename(old_path, new_path)
                idx += 1
            except Exception as e:
                print(f"重命名失败 {item}: {e}")

batch_rename_files("tips_3")

import cv2
import numpy as np
from PIL import Image, ImageDraw, ImageFont
import os

def add_watermark(input_folder, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for file in os.listdir(input_folder):
        if file.lower().endswith(('.jpg', '.png')):
            img_path = os.path.join(input_folder, file)
            img = cv2.imread(img_path)
            if img is not None:
                img_pil = Image.fromarray(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
                draw = ImageDraw.Draw(img_pil)
                
                # 字体设置
                try:
                    font = ImageFont.truetype("/System/Library/Fonts/PingFang.ttc", 50)
                except:
                    font = ImageFont.load_default()
                
                draw.text((100, 100), "版权水印", fill=(168, 121, 103), font=font)
                
                img_out = cv2.cvtColor(np.asarray(img_pil), cv2.COLOR_RGB2BGR)
                out_path = os.path.join(output_folder, file)
                cv2.imwrite(out_path, img_out)

def adjust_saturation(input_folder, output_folder, factor=1.5):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for file in os.listdir(input_folder):
        if file.lower().endswith(('.jpg', '.png')):
            img_path = os.path.join(input_folder, file)
            pic = cv2.imread(img_path, cv2.IMREAD_UNCHANGED)
            if pic is not None:
                hsv = cv2.cvtColor(pic, cv2.COLOR_BGR2HSV)
                h, s, v = cv2.split(hsv)
                s_new = np.uint8(s * factor)
                merged = cv2.merge([h, s_new, v])
                res = cv2.cvtColor(merged, cv2.COLOR_HSV2BGR)
                cv2.imwrite(os.path.join(output_folder, file), res)

def resize_images(input_folder, output_folder, scale=0.5):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for file in os.listdir(input_folder):
        if file.lower().endswith(('.jpg', '.png')):
            img_path = os.path.join(input_folder, file)
            pic = cv2.imread(img_path)
            if pic is not None:
                h, w = pic.shape[:2]
                new_w, new_h = int(w * scale), int(h * scale)
                resized = cv2.resize(pic, (new_w, new_h))
                cv2.imwrite(os.path.join(output_folder, file), resized)

虚拟环境管理：使用 venv 或 conda 隔离项目依赖，避免库版本冲突。

python -m venv myenv
source myenv/bin/activate  # Linux/Mac
myenv\Scripts\activate     # Windows
pip install -r requirements.txt

异常处理：在网络请求和文件操作中，务必使用 try-except 块捕获潜在错误，防止程序崩溃。
编码规范：统一使用 UTF-8 编码处理文本文件，特别是在涉及中文文件名或内容时。
资源清理：下载临时文件后及时清理，避免占用过多磁盘空间。
反爬策略：在进行网络爬虫开发时，请遵守目标网站的 robots.txt 协议，控制请求频率，避免对服务器造成压力。

Python 十大实用技巧：爬虫、自动化与数据处理

1. Python 简介

1.1 什么是 Python？

1.2 发展历史

1.3 主要特点

2. Python 典型应用场景

3. 实战案例详解

3.1 爬取文档与资讯

第一步：下载网页并保存

更多推荐文章

相关免费在线工具

第二步：解析 HTML 提取数据

3.2 抓取表格与数据分析

3.3 批量下载图片

3.4 数据可视化

3.5 文本分析与词云

3.6 文件批量管理

3.7 图片批量处理

添加水印

4. 开发环境与最佳实践

5. 总结

更多推荐文章

相关免费在线工具

Python 十大实用技巧：爬虫、自动化与数据处理

1. Python 简介

1.1 什么是 Python？

1.2 发展历史

1.3 主要特点

2. Python 典型应用场景

3. 实战案例详解

3.1 爬取文档与资讯

第一步：下载网页并保存

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

第二步：解析 HTML 提取数据

3.2 抓取表格与数据分析

3.3 批量下载图片

3.4 数据可视化

3.5 文本分析与词云

3.6 文件批量管理

3.7 图片批量处理

添加水印

4. 开发环境与最佳实践

5. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具