Python 爬虫技术实战指南：从入门到分布式采集

Python 爬虫技术实战指南：从入门到分布式采集 | 极客日志

python -m venv crawler_env
source crawler_env/bin/activate  # Windows: crawler_env\Scripts\activate
pip install requests lxml beautifulsoup4 pymongo redis

import requests
from requests.exceptions import RequestException

def fetch_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    }
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        return response.text
    except RequestException as e:
        print(f"Request failed: {e}")
        return None

url = 'https://example.com'
text = fetch_page(url)
if text:
    print(text[:500])

session = requests.Session()
session.headers.update(headers)
resp = session.post('https://example.com/login', data={'user': 'admin'})

from lxml import etree

html_content = '''<div class="news">
    <h1>标题</h1>
    <p class="author">作者</p>
</div>'''
tree = etree.HTML(html_content)
# 提取标题
title = tree.xpath('//div[@class="news"]/h1/text()')
print(title)

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'lxml')
items = soup.find_all('div', class_='news')
for item in items:
    print(item.get_text())

import re
pattern = r'\d{3}-\d{4}'
match = re.search(pattern, text)
if match:
    print(match.group())

headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://www.google.com/'
}

proxies_list = ['http://ip1:port', 'http://ip2:port']
proxy = random.choice(proxies_list)
response = requests.get(url, proxies={'http': proxy}, timeout=5)

import time
time.sleep(random.uniform(1, 3))

scrapy startproject myspider
cd myspider
scrapy genspider example example.com

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/page/1']

    def parse(self, response):
        for href in response.css('a.next::attr(href)').getall():
            yield response.follow(href, self.parse)
        
        yield {
            'title': response.css('h1::text').get(),
            'content': response.css('.content::text').getall()
        }

from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client['crawler_db']
collection = db['articles']

data = {'title': '示例文章', 'url': 'https://...'}
result = collection.insert_one(data)
print(result.inserted_id)

import pymysql
conn = pymysql.connect(host='localhost', user='root', password='pwd', database='db')
cursor = conn.cursor()
cursor.execute("INSERT INTO articles (title) VALUES (%s)", ('Title',))
conn.commit()

import redis
r = redis.Redis(host='localhost', port=6379)
r.lpush('crawl_queue', 'https://example.com')
while True:
    url = r.rpop('crawl_queue')
    if url:
        process(url)

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Crawl started")

Python 爬虫技术实战指南：从入门到分布式采集

Python 爬虫技术实战指南

引言

一、基础环境与 HTTP 请求

1.1 环境准备

1.2 发送 HTTP 请求

二、页面解析与数据提取

2.1 XPath 解析

2.2 CSS 选择器与 BeautifulSoup

2.3 正则表达式

三、反爬策略应对技巧

3.1 请求头伪装

3.2 代理 IP 池

3.3 验证码与加密

3.4 频率控制

四、工程化框架 Scrapy

4.1 项目初始化

4.2 编写 Spider

4.3 中间件与管道

五、数据存储与管理

5.1 MongoDB 存储

5.2 MySQL 存储

六、分布式爬虫架构

6.1 Redis 任务队列

6.2 Scrapy-Redis 集成

七、性能优化与监控

7.1 并发控制

7.2 日志记录

结语

更多推荐文章

相关免费在线工具

Python 爬虫技术实战指南：从入门到分布式采集

Python 爬虫技术实战指南

引言

一、基础环境与 HTTP 请求

1.1 环境准备

1.2 发送 HTTP 请求

二、页面解析与数据提取

2.1 XPath 解析

2.2 CSS 选择器与 BeautifulSoup

2.3 正则表达式

三、反爬策略应对技巧

3.1 请求头伪装

3.2 代理 IP 池

3.3 验证码与加密

3.4 频率控制

四、工程化框架 Scrapy

4.1 项目初始化

4.2 编写 Spider

4.3 中间件与管道

五、数据存储与管理

5.1 MongoDB 存储

5.2 MySQL 存储

六、分布式爬虫架构

6.1 Redis 任务队列

6.2 Scrapy-Redis 集成

七、性能优化与监控

7.1 并发控制

7.2 日志记录

结语

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具