Python 爬虫从入门到实战：Requests、Scrapy、异步与反爬 | 极客日志

PythonAI

Python 爬虫从入门到实战：Requests、Scrapy、异步与反爬

这篇内容把 Python 爬虫的主线串了起来：从 requests + BeautifulSoup 入门，到 lxml、parsel、PyQuery 等解析工具，再到 Scrapy 项目结构、Selenium/Playwright 动态渲染、aiohttp 和 HTTPX 异步抓取，以及 MySQL、MongoDB、Redis、Bloom Filter 的存储与去重方案。文章还补了反爬常见手段、Scrapy-Redis 分布式调度和一个 Scrapy + MySQL 的完整案例，整体结论很明确：小任务先用 Requests，项目化用 Scrapy，动态页面看接口优先，抓取规模上来后再考虑异步和分布式。

ArchDesign发布于 2026/6/300 浏览

前言

如果你第一次接触爬虫，最容易被'全栈''大全''一篇搞定'这类标题带偏。真做起来其实没那么玄，核心就三件事：发请求、解析内容、把数据存下来。剩下的 Requests、BeautifulSoup、Scrapy、Selenium、Playwright、aiohttp，不过是把这三件事做得更稳、更快，或者更像浏览器一点。

这篇内容按实际开发里常见的路径来写：先用 requests + BeautifulSoup 把基础打牢，再看 lxml、parsel、PyQuery 这类解析工具；如果页面是 JavaScript 动态渲染，就上 Selenium 或 Playwright；抓取量一大，异步和分布式就绕不开；最后再把存储、去重和反爬这些容易踩坑的地方串起来。示例都基于 Python 3.8+，更建议直接用 Python 3.10 以上，少碰兼容性问题。

爬取网站数据时，先看 robots.txt 和站点规则。能走公开接口就别硬怼页面，省事，也更不容易把自己卡死在反爬上。

1. 爬虫到底在做什么

爬虫并不复杂。它本质上就是一个程序：

请求一个 URL；
拿到 HTML、JSON、图片或其他响应；
从里面提取需要的数据；
清洗、去重、存储；
继续翻页或沿着链接往下抓。

常见场景也很直接：电商比价、舆情监控、招聘信息采集、学术数据整理、内容聚合。搜索引擎索引网页也是同一类事情，只是规模大得多。

做爬虫时，很多问题不是'会不会写代码'，而是'这一步到底该不该做'。比如动态页面，优先看接口；能直接拿 JSON，就别开浏览器。这个判断通常比堆工具更重要。

1.1 先把边界说清楚

请求前先看 robots.txt，别默认所有页面都能抓。
有些站点会限制商业用途或批量访问，别忽略条款。
控制频率，别把请求打得像压测。
数据的后续使用也要注意合规，别拿'学习用途'当借口。

2. 开发环境

2.1 安装 Python

建议直接装 Python 3.10 或更高版本。

Windows 去 https://www.python.org/downloads 下载对应安装包，安装时勾上'Add Python 3.x to PATH'。

macOS 可以用 Homebrew：

brew install [email protected]

Linux（Ubuntu/Debian）常用这组：

sudo apt update && sudo apt install python3 python3-pip python3-venv -y
python3 --version
pip3 --version

如果机器里同时有 Python 2 和 Python 3，后面大多数命令都建议显式写 python3、pip3。

2.2 虚拟环境

爬虫项目很容易把依赖装乱。单独建一个虚拟环境，后面省很多事。

mkdir my_spider && cd my_spider
python3 -m venv venv

# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

激活后安装的包只会落在这个环境里。这个习惯值得保留，尤其是同时折腾 Selenium、Playwright、Scrapy 的时候。

相关免费在线工具

RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online

pip install requests beautifulsoup4 lxml

import requests

url = 'https://httpbin.org/get'
params = {'q': 'python 爬虫', 'page': 1}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...'}
response = requests.get(url, params=params, headers=headers, timeout=10)
print(response.status_code)
print(response.encoding)
print(response.text[:200])

import requests
from bs4 import BeautifulSoup


def fetch_title(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...'}
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        response.encoding = response.apparent_encoding

        soup = BeautifulSoup(response.text, 'lxml')
        title_tag = soup.find('title')
        if title_tag:
            return title_tag.get_text().strip()
        return '未找到 title 标签'
    except Exception as e:
        return f'抓取失败：{e}'


if __name__ == '__main__':
    url = 'https://www.example.com'
    title = fetch_title(url)
    print(f'网页标题：{title}')

soup.find('div', class_='content')
soup.find_all('a', href=True)
soup.select('div.content > ul li a')
node.get('href')
node.get_text(strip=True)

import sqlite3

conn = sqlite3.connect('spider.db')
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE IF NOT EXISTS articles (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        title TEXT,
        url TEXT UNIQUE
    );
''')
items = [('第一篇', 'https://...'), ('第二篇', 'https://...')]
for title, url in items:
    try:
        cursor.execute('INSERT INTO articles (title, url) VALUES (?, ?)', (title, url))
    except sqlite3.IntegrityError:
        pass
conn.commit()
conn.close()

import requests

session = requests.Session()
login_data = {'username': 'xxx', 'password': 'xxx'}
session.post('https://example.com/login', data=login_data)
response = session.get('https://example.com/protected-page')

from lxml import etree

html = '''<html><body>
<div><h2><a href="/p1">文章 A</a></h2></div>
<div><h2><a href="/p2">文章 B</a></h2></div>
</body></html>'''

tree = etree.HTML(html)
titles = tree.xpath('//div[@class="post"]/h2/a/text()')
links = tree.xpath('//div[@class="post"]/h2/a/@href')
for t, l in zip(titles, links):
    print(t, l)

from parsel import Selector

html = '''<ul>
<li class="item"><a href="/a1">Item1</a></li>
<li class="item"><a href="/a2">Item2</a></li>
</ul>'''
sel = Selector(text=html)
for item in sel.css('li.item'):
    title = item.css('a::text').get()
    link = item.css('a::attr(href)').get()
    print(title, link)

from pyquery import PyQuery as pq

html = '''<div>
<h2><a href="/x1">新闻 X1</a></h2>
<h2><a href="/x2">新闻 X2</a></h2>
</div>'''
doc = pq(html)
for item in doc('#posts h2'):
    a = pq(item).find('a')
    title = a.text()
    url = a.attr('href')
    print(title, url)

import re
from bs4 import BeautifulSoup

html = '''<div> 联系邮箱：[email protected] 联系电话：123-4567-890 </div>'''
soup = BeautifulSoup(html, 'lxml')
info = soup.find('div').get_text()
email_pattern = r'[\w\.-]+@[\w\.-]+'
emails = re.findall(email_pattern, info)
phone_pattern = r'\d{3}-\d{4}-\d{3,4}'
phones = re.findall(phone_pattern, info)
print('邮箱：', emails)
print('电话：', phones)

pip install scrapy
scrapy startproject myproject

myproject/
  scrapy.cfg
  myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
      __init__.py
      example_spider.py

import scrapy
from myproject.items import MyprojectItem


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = MyprojectItem()
            item['text'] = quote.css('span.text::text').get()
            item['author'] = quote.css('small.author::text').get()
            item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield item

        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

# settings.py
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
}

scrapy shell 'https://quotes.toscrape.com/'

response.css('div.quote span.text::text').getall()
response.xpath('//div[@class="quote"]/span[@class="text"]/text()').getall()

CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True

pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')

service = ChromeService(executable_path='path/to/chromedriver')
driver = webdriver.Chrome(service=service, options=chrome_options)
try:
    driver.get('https://quotes.toscrape.com/js/')
    time.sleep(2)
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    for quote in soup.select('div.quote'):
        text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        print(text, author)
finally:
    driver.quit()

pip install playwright
playwright install

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup


def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto('https://quotes.toscrape.com/js/')
        page.wait_for_selector('div.quote')
        html = page.content()
        browser.close()

        soup = BeautifulSoup(html, 'lxml')
        for quote in soup.select('div.quote'):
            text = quote.select_one('span.text').get_text()
            author = quote.select_one('small.author').get_text()
            print(text, author)


if __name__ == '__main__':
    main()

pip install aiohttp

import asyncio
import aiohttp
from bs4 import BeautifulSoup


async def fetch(session, url):
    try:
        async with session.get(url, timeout=10) as response:
            return await response.text()
    except Exception as e:
        print(f'抓取 {url} 失败：{e}')
        return None


async def parse(html, url):
    if not html:
        return
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('title').get_text(strip=True) if soup.find('title') else 'N/A'
    print(f'URL: {url}，Title: {title}')


async def main(urls):
    conn = aiohttp.TCPConnector(limit=50)
    async with aiohttp.ClientSession(connector=conn) as session:
        tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
        htmls = await asyncio.gather(*tasks)
        for html, url in zip(htmls, urls):
            await parse(html, url)


if __name__ == '__main__':
    urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
    asyncio.run(main(urls))

pip install httpx

import asyncio
import httpx
from bs4 import BeautifulSoup


async def fetch(client, url):
    try:
        resp = await client.get(url, timeout=10.0)
        resp.raise_for_status()
        return resp.text
    except Exception as e:
        print(f'Error {url}: {e}')
        return None


async def main(urls):
    async with httpx.AsyncClient(limits=httpx.Limits(max_connections=50)) as client:
        tasks = [asyncio.create_task(fetch(client, url)) for url in urls]
        for coro in asyncio.as_completed(tasks):
            html = await coro
            if html:
                title = BeautifulSoup(html, 'lxml').find('title').get_text(strip=True)
                print('Title:', title)


if __name__ == '__main__':
    urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
    asyncio.run(main(urls))

import sqlite3

conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS items (id INTEGER PRIMARY KEY, title TEXT, url TEXT UNIQUE)')
data = [('标题 1', 'https://a.com/1'), ('标题 2', 'https://a.com/2')]
for title, url in data:
    try:
        cursor.execute('INSERT INTO items (title, url) VALUES (?, ?)', (title, url))
    except sqlite3.IntegrityError:
        pass
conn.commit()
conn.close()

import redis

r = redis.Redis(host='localhost', port=6379, db=0)
url = 'https://example.com/page/1'
if r.sadd('visited_urls', url):
    print('新 URL，可爬取')
else:
    print('URL 已存在，跳过')

pip install scrapy-redis

from scrapy_redis.spiders import RedisSpider
from myproject.items import MyprojectItem


class RedisQuotesSpider(RedisSpider):
    name = 'redis_quotes'
    redis_key = 'redis_quotes:start_urls'

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = MyprojectItem()
            item['text'] = quote.css('span.text::text').get()
            item['author'] = quote.css('small.author::text').get()
            item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield item
            next_page = response.css('li.next a::attr(href)').get()
            if next_page:
                yield response.follow(next_page, callback=self.parse)

SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
SCHEDULER_PERSIST = True
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
REDIS_URL = 'redis://:[email protected]:6379/0'

headers = {
    'User-Agent': 'Mozilla/5.0 ...',
    'Referer': 'https://example.com/',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Cookie': 'sessionid=xxx; other=yyy',
}

import time, random
time.sleep(random.uniform(1, 3))

session = requests.Session()
login_page = session.get('https://example.com/login')
from bs4 import BeautifulSoup
soup = BeautifulSoup(login_page.text, 'lxml')
token = soup.find('input', {'name': 'csrf_token'})['value']
data = {'username': 'yourname', 'password': 'yourpwd', 'csrf_token': token}
session.post('https://example.com/login', data=data, headers={'User-Agent': '...'})
profile = session.get('https://example.com/profile')
print(profile.text)

pip install pytesseract pillow

from PIL import Image
import pytesseract

img = Image.open('captcha.png')
text = pytesseract.image_to_string(img).strip()
print('识别结果：', text)

import random

class RandomProxyMiddleware:
    def __init__(self, proxies):
        self.proxies = proxies

    @classmethod
    def from_crawler(cls, crawler):
        return cls(proxies=crawler.settings.get('PROXY_LIST'))

    def process_request(self, request, spider):
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy

import scrapy
from news_spider.items import NewsSpiderItem


class NewsSpider(scrapy.Spider):
    name = 'news'
    allowed_domains = ['news.example.com']
    start_urls = ['https://news.example.com/']

    def parse(self, response):
        for news in response.css('div.headline-list div.item'):
            item = NewsSpiderItem()
            item['title'] = news.css('h2.title::text').get().strip()
            item['summary'] = news.css('p.summary::text').get().strip()
            item['url'] = response.urljoin(news.css('a::attr(href)').get())
            item['pub_date'] = news.css('span.pub-date::text').get().strip()
            yield scrapy.Request(url=item['url'], callback=self.parse_detail, meta={'item': item})

        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        item = response.meta['item']
        pub_date = response.css('div.meta span.date::text').get().strip()
        item['pub_date'] = self.parse_date(pub_date)
        yield item

    def parse_date(self, date_str):
        from datetime import datetime
        try:
            return datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S')
        except:
            return None

import pymysql
from pymysql.err import IntegrityError


class MySQLPipeline:
    def open_spider(self, spider):
        self.conn = pymysql.connect(
            host=spider.settings.get('MYSQL_HOST'),
            port=spider.settings.get('MYSQL_PORT'),
            user=spider.settings.get('MYSQL_USER'),
            password=spider.settings.get('MYSQL_PASSWORD'),
            db=spider.settings.get('MYSQL_DB'),
            charset=spider.settings.get('MYSQL_CHARSET'),
            cursorclass=pymysql.cursors.DictCursor
        )
        self.cursor = self.conn.cursor()
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS headline_news (
                id INT AUTO_INCREMENT PRIMARY KEY,
                title VARCHAR(255),
                summary TEXT,
                url VARCHAR(512) UNIQUE,
                pub_date DATETIME
            ) CHARACTER SET utf8mb4;
        """)
        self.conn.commit()

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

    def process_item(self, item, spider):
        insert_sql = """
            INSERT INTO headline_news (title, summary, url, pub_date)
            VALUES (%s, %s, %s, %s)
        """
        try:
            self.cursor.execute(insert_sql, (
                item.get('title'),
                item.get('summary'),
                item.get('url'),
                item.get('pub_date')
            ))
            self.conn.commit()
        except IntegrityError:
            pass
        return item

MYSQL_HOST = 'localhost'
MYSQL_PORT = 3306
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_DB = 'news_db'
MYSQL_CHARSET = 'utf8mb4'

ITEM_PIPELINES = {
    'news_spider.pipelines.MySQLPipeline': 300,
}

ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 8

库名	作用	适合的场景
requests	同步 HTTP 请求	大多数简单爬虫
httpx	同步/异步 HTTP 客户端	需要兼容 Requests 风格或异步
aiohttp	asyncio 异步请求	高并发抓取
BeautifulSoup	HTML/XML 解析	入门和中小项目
lxml	高性能解析 + XPath	快速提取、大量页面
parsel	Scrapy 风格解析	Scrapy 项目或类似写法
PyQuery	jQuery 风格解析	前端思维更顺手
Selenium	浏览器自动化	需要真实浏览器行为
Playwright	现代浏览器自动化	动态页面、自动化测试、抓取
scrapy	爬虫框架	结构化项目
scrapy-redis	分布式调度	多机协作
pymysql	MySQL 驱动	MySQL 入库
pymongo	MongoDB 驱动	文档型存储
redis	缓存/去重/队列	指纹去重、任务调度

Python 爬虫从入门到实战：Requests、Scrapy、异步与反爬

前言

1. 爬虫到底在做什么

1.1 先把边界说清楚

2. 开发环境

2.1 安装 Python

2.2 虚拟环境

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2.3 常用工具

3. 基础抓取：Requests + BeautifulSoup

3.1 安装

3.2 一个最小请求示例

3.3 第一个爬虫：抓网页标题

3.4 BeautifulSoup 常用法

3.5 存储

3.6 常见反爬应对

4. 解析工具：lxml、parsel、PyQuery、正则

4.1 lxml + XPath

4.2 parsel

4.3 PyQuery

4.4 正则

5. Scrapy：适合做项目的框架

5.1 安装和项目结构

5.2 一个简单 Spider

5.3 Item、Pipeline、Settings

5.4 Scrapy Shell

5.5 中间件和并发

6. 动态页面：Selenium 和 Playwright

6.1 Selenium

6.2 Playwright

6.3 什么时候别开浏览器

7. 异步爬虫：aiohttp 和 HTTPX

7.1 aiohttp

7.2 HTTPX

8. 存储和去重

8.1 常见存储

8.2 SQLite 去重

8.3 Redis 去重

8.4 Bloom Filter

9. 分布式爬虫：Scrapy-Redis

9.1 安装

9.2 基本思路

10. 反爬：别把问题想得太玄

10.1 请求头和频率

10.2 登录和 Cookie

10.3 验证码

10.4 代理池

11. 一个完整案例：Scrapy 抓新闻并入库

11.1 需求

11.2 Spider

11.3 MySQL Pipeline

11.4 配置

12. 常用库列表

13. 常见报错

13.1 ModuleNotFoundError: No module named 'xxx'

13.2 requests.exceptions.SSLError

13.3 chromedriver executable needs to be in PATH

13.4 MySQL Access denied

13.5 TimeoutError

14. 收尾

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

13.1 `ModuleNotFoundError: No module named 'xxx'`

13.2 `requests.exceptions.SSLError`

13.3 `chromedriver executable needs to be in PATH`

13.4 MySQL `Access denied`

13.5 `TimeoutError`