Python 网络爬虫高级应用与 Scrapy 框架实战
Python 网络爬虫的高级应用,涵盖 Scrapy 框架基础、动态网页爬取(Selenium)、反爬虫策略(代理 IP、UA 旋转、Cookies 池)及分布式爬虫(Scrapy-Redis)。包含豆瓣电影和淘宝商品的实战代码示例,详细讲解了请求发送、响应解析、数据存储流程及中间件配置方法。

Python 网络爬虫的高级应用,涵盖 Scrapy 框架基础、动态网页爬取(Selenium)、反爬虫策略(代理 IP、UA 旋转、Cookies 池)及分布式爬虫(Scrapy-Redis)。包含豆瓣电影和淘宝商品的实战代码示例,详细讲解了请求发送、响应解析、数据存储流程及中间件配置方法。

学习目标:掌握 Python 网络爬虫的高级技巧,包括 Scrapy 框架、分布式爬虫、动态网页爬取、反爬虫策略等;学习 Scrapy、Selenium、BeautifulSoup 等库的使用;通过实战案例实现网络爬虫应用。
学习重点:Scrapy 框架、分布式爬虫、动态网页爬取、反爬虫策略、Selenium 库、BeautifulSoup 库、网络爬虫实战。
网络爬虫(Web Crawler)是一种程序,用于自动访问网页并提取信息。网络爬虫的应用场景包括数据分析、搜索引擎、内容聚合等。
Scrapy 是一个用于爬取网站数据的开源 Python 框架。Scrapy 具有以下特点:
pip install scrapy
scrapy startproject myspider
cd myspider
scrapy genspider example example.com
# myspider/spiders/example.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]
def parse(self, response):
# 提取标题
title = response.css("title::text").get()
yield {"title": title}
scrapy crawl example -o output.json
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
# 初始化浏览器
driver = webdriver.Chrome()
# 访问网页
driver.get("https://example.com")
# 等待页面加载
time.sleep(5)
# 提取标题
title = driver.find_element(By.CSS_SELECTOR, "title").text
print(f"标题:{title}")
# 关闭浏览器
driver.quit()
# myspider/spiders/dynamic_spider.py
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
class DynamicSpider(scrapy.Spider):
name = "dynamic"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
# 使用 Selenium 访问网页
self.driver.get(response.url)
time.sleep(5)
title = self.driver.find_element(By.CSS_SELECTOR, "title").text
yield {"title": title}
def closed(self, reason):
self.driver.quit()
# myspider/middlewares.py
class ProxyMiddleware:
def process_request(self, request, spider):
request.meta["proxy"] = "http://127.0.0.1:8080"
# myspider/settings.py
DOWNLOADER_MIDDLEWARES = {
"myspider.middlewares.ProxyMiddleware": 543,
}
# myspider/middlewares.py
import random
class UserAgentMiddleware:
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.user_agents)
# myspider/settings.py
DOWNLOADER_MIDDLEWARES = {
"myspider.middlewares.UserAgentMiddleware": 543,
}
# myspider/middlewares.py
class CookiesMiddleware:
cookies = [
{"name": "session", "value": "123456"},
{"name": "session", "value": "789012"},
]
def process_request(self, request, spider):
request.cookies = random.choice(self.cookies)
# myspider/settings.py
DOWNLOADER_MIDDLEWARES = {
"myspider.middlewares.CookiesMiddleware": 543,
}
pip install scrapy-redis
# myspider/settings.py
# 启用 Scrapy-Redis 调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 启用 Scrapy-Redis 去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 启用 Scrapy-Redis 存储
ITEM_PIPELINES = {
"scrapy_redis.pipelines.RedisPipeline": 300,
}
# 配置 Redis 连接
REDIS_URL = "redis://127.0.0.1:6379/0"
# myspider/spiders/distributed_spider.py
import scrapy
from scrapy_redis.spiders import RedisSpider
class DistributedSpider(RedisSpider):
name = "distributed"
allowed_domains = ["example.com"]
redis_key = "distributed:start_urls"
def parse(self, response):
title = response.css("title::text").get()
yield {"title": title}
# 启动 Redis 服务器
redis-server
# 启动爬虫
scrapy runspider myspider/spiders/distributed_spider.py
# 向 Redis 添加起始 URL
redis-cli lpush distributed:start_urls https://example.com
开发一个爬虫,爬取豆瓣电影 Top250 的信息,包括电影名称、评分、导演、演员、年份等。
# myspider/spiders/douban_spider.py
import scrapy
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/top250"]
def parse(self, response):
# 提取电影信息
movies = response.css(".item")
for movie in movies:
title = movie.css(".title::text").get()
rating = movie.css(".rating_num::text").get()
director = movie.css(".info .bd p:first-child::text").get()
year = movie.css(".info .bd p:nth-child(2)::text").get()
yield {
"title": title,
"rating": rating,
"director": director,
"year": year
}
# 提取下一页 URL
next_page = response.css(".next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
scrapy crawl douban -o douban_top250.json
开发一个爬虫,爬取淘宝商品的信息,包括商品名称、价格、销量、评价等。
# myspider/spiders/taobao_spider.py
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
class TaobaoSpider(scrapy.Spider):
name = "taobao"
allowed_domains = ["taobao.com"]
start_urls = ["https://www.taobao.com"]
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
# 使用 Selenium 访问淘宝
self.driver.get(response.url)
time.sleep(5)
# 搜索商品
search_box = self.driver.find_element(By.CSS_SELECTOR, "#q")
search_box.send_keys("Python")
search_box.submit()
time.sleep(5)
# 提取商品信息
products = self.driver.find_elements(By.CSS_SELECTOR, ".item.J_MouserOnverReq")
for product in products:
name = product.find_element(By.CSS_SELECTOR, ".title").text
price = product.find_element(By.CSS_SELECTOR, ".price").text
sales = product.find_element(By.CSS_SELECTOR, ".deal-cnt").text
yield {
"name": name,
"price": price,
"sales": sales
}
def closed(self, reason):
self.driver.quit()
scrapy crawl taobao -o taobao_products.json
本文详细介绍了 Python 网络爬虫的高级技巧,包括 Scrapy 框架、分布式爬虫、动态网页爬取、反爬虫策略等;学习了 Scrapy、Selenium、BeautifulSoup 等库的使用;通过实战案例实现了爬取豆瓣电影和淘宝商品。

微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online
将 Markdown(GFM)转为 HTML 片段,浏览器内 marked 解析;与 HTML 转 Markdown 互为补充。 在线工具,Markdown 转 HTML在线工具,online
将 HTML 片段转为 GitHub Flavored Markdown,支持标题、列表、链接、代码块与表格等;浏览器内处理,可链接预填。 在线工具,HTML 转 Markdown在线工具,online
通过删除不必要的空白来缩小和压缩JSON。 在线工具,JSON 压缩在线工具,online