Python 爬虫实战：7 个日常常用案例详解 | 极客日志

Python大前端算法

Python 爬虫实战：7 个日常常用案例详解

综述由AI生成Python 爬虫技术通过 requests、BeautifulSoup、lxml 等库实现网页数据抓取。涵盖豆瓣电影、猫眼、高校名单、天气、图书、段子及微博等 7 个典型案例，展示正则、XPath、Selenium 等不同解析方式。内容包含完整代码示例、环境配置说明及反爬应对策略，适合初学者入门学习。重点介绍了静态与动态页面的区别处理，以及遵守 robots.txt 协议和数据安全的重要性。

魔尊发布于 2025/2/6更新于 2026/6/327 浏览

Python 爬虫实战：7 个日常常用案例详解

前言

网络爬虫（Web Crawler）是自动化抓取互联网数据的重要工具，广泛应用于数据分析、信息监控、竞品调研等场景。Python 凭借其简洁的语法和丰富的第三方库（如 requests、BeautifulSoup、lxml、Selenium），成为爬虫开发的首选语言。

本文将详细介绍 7 个经典的 Python 爬虫案例，涵盖静态页面解析（正则、XPath、BeautifulSoup）和动态页面渲染（Selenium）。每个案例均包含完整代码、逻辑分析及注意事项，帮助读者系统掌握爬虫技术。

环境准备

在开始之前，请确保已安装以下依赖库：

pip install requests beautifulsoup4 lxml selenium

同时建议配置好代理 IP 和 User-Agent，以应对部分网站的反爬机制。

1. 爬取豆瓣电影 Top250

目标：获取豆瓣电影 Top250 的电影名称、评分和评价人数，并保存为 CSV 文件。

技术点：requests + BeautifulSoup + CSV 写入。

代码示例：

import requests
from bs4 import BeautifulSoup
import csv
import time

# 请求 URL
url = 'https://movie.douban.com/top250'

# 请求头部，模拟浏览器访问
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# 解析页面函数
def parse_html(html):
    soup = BeautifulSoup(html, 'lxml')
    movie_list = soup.find('ol', class_='grid_view').find_all('li')
    for movie in movie_list:
        title = movie.find('div', class_='hd').find('span', class_='title').get_text().strip()
        rating_num = movie.find('div', class_='star').find('span', class_='rating_num').get_text().strip()
        comment_num = movie.find(, class_=).find_all()[-].get_text().strip()
         {: title, : rating_num, : comment_num}


 ():
     (, , newline=, encoding=)  f:
        writer = csv.DictWriter(f, fieldnames=[, , ])
        writer.writeheader()
         i  ():  
            page_url = 
            response = requests.get(page_url, headers=headers)
             response.status_code == :
                 item  parse_html(response.text):
                    writer.writerow({: item[], : item[], : item[]})
                    ()
            time.sleep()  

 __name__ == :
    save_data()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online

import requests
import re
import time

url = 'https://maoyan.com/board/4'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# 解析页面函数
def parse_html(html):
    # 使用原始字符串 r'' 避免转义问题
    pattern = re.compile(r'<p class="name"><a href=".*?" title="(.*?)" data-act="boarditem-click" data-val="{movieId:\\d+}">(.*?)</a></p>.*?<p class="star">(.*?)</p>.*?<p class="releasetime">(.*?)</p>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            '电影名称': item[1],
            '主演': item[2].strip(),
            '上映时间': item[3]
        }

# 保存数据函数
def save_data():
    with open('maoyan_top100.txt', 'w', encoding='utf-8') as f:
        for i in range(10):
            page_url = f'{url}?offset={i*10}'
            response = requests.get(page_url, headers=headers)
            if response.status_code == 200:
                for item in parse_html(response.text):
                    f.write(str(item) + '\n')
            time.sleep(1)

if __name__ == '__main__':
    save_data()

import requests
import re
import time

url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def parse_html(html):
    # 匹配表格行内容
    pattern = re.compile(r'<tr class="alt">.*?<td>(.*?)</td>.*?<td><div align="left">.*?<a href="(.*?)" target="_blank">(.*?)</a></div></td>.*?<td>(.*?)</td>.*?<td>(.*?)</td>.*?</tr>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            '排名': item[0],
            '学校名称': item[2],
            '省市': item[3],
            '总分': item[4]
        }

def save_data():
    with open('university_top100.txt', 'w', encoding='utf-8') as f:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            for item in parse_html(response.text):
                f.write(str(item) + '\n')

if __name__ == '__main__':
    save_data()

import requests
from lxml import etree
import csv
import time

url = 'http://www.weather.com.cn/weather1d/101010100.shtml'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def parse_html(html):
    selector = etree.HTML(html)
    try:
        city = selector.xpath('//*[@id="around"]/div/div[1]/div[1]/h1/text()')[0]
        temperature = selector.xpath('//*[@id="around"]/div/div[1]/div[1]/p/i/text()')[0]
        weather = selector.xpath('//*[@id="around"]/div/div[1]/div[1]/p/@title')[0]
        wind = selector.xpath('//*[@id="around"]/div/div[1]/div[1]/p/span/text()')[0]
        return city, temperature, weather, wind
    except IndexError:
        return None, None, None, None

def save_data():
    with open('beijing_weather.csv', 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.writer(f)
        writer.writerow(['城市', '温度', '天气', '风力'])
        for i in range(3):  # 演示循环，实际可定时任务
            response = requests.get(url, headers=headers)
            city, temperature, weather, wind = parse_html(response.text)
            if city:
                writer.writerow([city, temperature, weather, wind])
            time.sleep(1)

if __name__ == '__main__':
    save_data()

import requests
from lxml import etree
import csv
import time

url = 'http://search.dangdang.com/?key=Python&act=input'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def parse_html(html):
    selector = etree.HTML(html)
    book_list = selector.xpath('//*[@id="search_nature_rg"]/ul/li')
    for book in book_list:
        try:
            title = book.xpath('a/@title')[0]
            link = book.xpath('a/@href')[0]
            price = book.xpath('p[@class="price"]/span[@class="search_now_price"]/text()')[0]
            author = book.xpath('p[@class="search_book_author"]/span[1]/a/@title')[0]
            publish_date = book.xpath('p[@class="search_book_author"]/span[2]/text()')[0]
            publisher = book.xpath('p[@class="search_book_author"]/span[3]/a/@title')[0]
            yield {
                '书名': title,
                '链接': link,
                '价格': price,
                '作者': author,
                '出版日期': publish_date,
                '出版社': publisher
            }
        except IndexError:
            continue

def save_data():
    with open('dangdang_books.csv', 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.writer(f)
        writer.writerow(['书名', '链接', '价格', '作者', '出版日期', '出版社'])
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            for item in parse_html(response.text):
                writer.writerow(item.values())

if __name__ == '__main__':
    save_data()

import requests
from lxml import etree
import time

url = 'https://www.qiushibaike.com/text/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def parse_html(html):
    selector = etree.HTML(html)
    content_list = selector.xpath('//div[@class="content"]/span/text()')
    for content in content_list:
        yield content.strip()

def save_data():
    with open('qiushibaike_jokes.txt', 'w', encoding='utf-8') as f:
        for i in range(3):
            page_url = f'https://www.qiushibaike.com/text/page/{i+1}/'
            response = requests.get(page_url, headers=headers)
            if response.status_code == 200:
                for content in parse_html(response.text):
                    f.write(content + '\n')
            time.sleep(1)

if __name__ == '__main__':
    save_data()

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests

url = 'https://weibo.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def parse_html(html):
    print(html[:500])  # 仅打印前 500 字符用于调试

def save_data():
    with open('weibo.txt', 'w', encoding='utf-8') as f:
        # 初始化浏览器驱动
        browser = webdriver.Chrome()
        browser.get(url)
        
        # 等待页面加载
        time.sleep(10)
        
        # 模拟登录（注意：实际使用中请勿硬编码账号密码）
        try:
            username_input = browser.find_element(By.NAME, 'username')
            password_input = browser.find_element(By.NAME, 'password')
            login_btn = browser.find_element(By.CLASS_NAME, 'W_btn_a')
            
            # 此处仅为演示，实际应通过环境变量或配置文件管理敏感信息
            # username_input.send_keys('your_username')
            # password_input.send_keys('your_password')
            # login_btn.click()
            # time.sleep(10)
            
            # 获取登录后的 Cookie
            cookies = browser.get_cookies()
            
            # 使用 Cookie 发起请求
            response = requests.get(url, headers=headers, cookies=cookies)
            parse_html(response.text)
            f.write(response.text)
        except Exception as e:
            print(f'Login failed: {e}')
        finally:
            browser.close()

if __name__ == '__main__':
    save_data()

Python 爬虫实战：7 个日常常用案例详解

Python 爬虫实战：7 个日常常用案例详解

前言

环境准备

1. 爬取豆瓣电影 Top250

更多推荐文章

相关免费在线工具

2. 爬取猫眼电影 Top100

3. 爬取全国高校名单

4. 爬取中国天气网城市天气

5. 爬取当当网图书信息

6. 爬取糗事百科段子

7. 爬取新浪微博（Selenium 动态加载）

爬虫最佳实践与注意事项

1. 遵守 robots.txt

2. 设置合理的请求间隔

3. 处理反爬机制

4. 数据安全与隐私

5. 异常处理

结语

更多推荐文章

相关免费在线工具

Python 爬虫实战：7 个日常常用案例详解

Python 爬虫实战：7 个日常常用案例详解

前言

环境准备

1. 爬取豆瓣电影 Top250

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. 爬取猫眼电影 Top100

3. 爬取全国高校名单

4. 爬取中国天气网城市天气

5. 爬取当当网图书信息

6. 爬取糗事百科段子

7. 爬取新浪微博（Selenium 动态加载）

爬虫最佳实践与注意事项

1. 遵守 robots.txt

2. 设置合理的请求间隔

3. 处理反爬机制

4. 数据安全与隐私

5. 异常处理

结语

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具