Python 爬虫技术原理与实战应用指南

Python 爬虫技术原理与实战应用指南 | 极客日志

import requests

url = 'http://www.example.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)

if response.status_code == 200:
    content = response.text
else:
    print(f'Error: {response.status_code}')

import re

html = '<a href="http://www.example.com">Example</a>'
pattern = r'<a\s+href=["\']([^"\']+)["\']>([^<]+)</a>'
matches = re.findall(pattern, html)
for href, text in matches:
    print(href, text)

from lxml import etree

html = '<div class="list"><a href="/p1">Link1</a></div>'
tree = etree.HTML(html)
links = tree.xpath('//a/@href')
print(links)

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
for tag in a_tags:
    print(tag.get('href'), tag.get_text())

import csv

data = [['Title', 'Author'], ['Article A', 'User1']]
with open('output.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    writer.writerows(data)

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        yield {'url': response.url}

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://www.example.com')
# 等待元素加载
content = driver.page_source
driver.quit()

proxies = {
    'http': 'http://127.0.0.1:8080',
    'https': 'http://127.0.0.1:8080'
}
response = requests.get(url, proxies=proxies)

import time

time.sleep(1) # 每次请求间隔 1 秒

session = requests.Session()
session.post(login_url, data=data)
resp = session.get(data_url)

import logging

logging.basicConfig(level=logging.INFO)
try:
    response = requests.get(url, timeout=5)
except Exception as e:
    logging.error(f'Request failed: {e}')

Python 爬虫技术原理与实战应用指南

Python 爬虫技术原理与实战应用指南

引言

1. HTTP 请求与响应机制

1.1 HTTP 请求结构

1.2 HTTP 响应处理

2. 网页解析技术

2.1 正则表达式

2.2 XPath 解析

2.3 BeautifulSoup 解析

2.4 数据存储策略

3. 爬虫框架架构

3.1 Scrapy 框架

3.2 Selenium 自动化

4. 高级技术与反爬应对

4.1 代理 IP 池

4.2 请求频率控制

4.3 验证码识别

4.4 登录与会话管理

5. 并发与性能优化

5.1 多线程

5.2 异步 IO

6. 异常处理与监控

结语

更多推荐文章

相关免费在线工具

Python 爬虫技术原理与实战应用指南

Python 爬虫技术原理与实战应用指南

引言

1. HTTP 请求与响应机制

1.1 HTTP 请求结构

1.2 HTTP 响应处理

2. 网页解析技术

2.1 正则表达式

2.2 XPath 解析

2.3 BeautifulSoup 解析

2.4 数据存储策略

3. 爬虫框架架构

3.1 Scrapy 框架

3.2 Selenium 自动化

4. 高级技术与反爬应对

4.1 代理 IP 池

4.2 请求频率控制

4.3 验证码识别

4.4 登录与会话管理

5. 并发与性能优化

5.1 多线程

5.2 异步 IO

6. 异常处理与监控

结语

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具