Python 爬虫核心技术原理与实战解析 | 极客日志

Python算法

Python 爬虫核心技术原理与实战解析

Python 爬虫的基础原理、HTTP 请求响应机制、网页解析技术（正则、XPath、BeautifulSoup）、常用框架（Scrapy、Selenium）以及反爬虫应对策略。通过实例演示了数据提取与存储流程，并强调了遵守网站协议和法律法规的重要性。

战神发布于 2025/2/7更新于 2026/7/2237 浏览

导言

随着互联网的发展，大量的数据被存储在网络上，而我们需要从中获取有用的信息。Python 作为一种功能强大且易于学习的编程语言，被广泛用于网络爬虫的开发。本文将详细介绍 Python 爬虫所需的技术及其原理，并提供相关的代码案例。

1. HTTP 请求与响应

在爬取网页数据之前，我们需要了解 HTTP 协议，它是在 Web 上进行数据交互的基础协议。HTTP 请求与响应是爬虫工作的基础，我们需要了解它们的结构和交互方式。

1.1 HTTP 请求

HTTP 请求由请求行、请求头和请求体组成。其中，请求行包括请求方法、请求的 URL 和协议版本；请求头包含了用于描述请求的各种信息；请求体是可选项，用于传输请求的数据。下面是一个 HTTP 请求的示例：

GET /path/to/resource HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
...

在 Python 中，我们可以使用 requests 库发送 HTTP 请求。下面是一个使用 requests 库发送 GET 请求的示例代码：

import requests

url = 'http://www.example.com'
try:
    response = requests.get(url, timeout=5)
    print(response.text)
except Exception as e:
    print(f"Request failed: {e}")

1.2 HTTP 响应

HTTP 响应由响应行、响应头和响应体组成。响应行包含了响应的状态码和状态消息；响应头包含了用于描述响应的各种信息；响应体是实际返回的数据。下面是一个 HTTP 响应的示例：

HTTP/1.1 200 OK
Content-Type: text/html; =
: 
...
<>
...
</>

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online

import requests

url = 'http://www.example.com'
try:
    response = requests.get(url)
    print(response.status_code)
    print(response.headers)
    print(response.text)
except Exception as e:
    print(f"Error: {e}")

import re

html = '<a href="http://example.com">Example</a>'
links = re.findall(r'<a href="([^"]+)">([^<]+)</a>', html)
for link in links:
    print(link[0], link[1])

from lxml import etree

html = '<a href="http://www.example.com">Example</a>'
tree = etree.HTML(html)
links = tree.xpath('//a')
for link in links:
    print(link.get('href'), link.text)

from bs4 import BeautifulSoup

html = '<a href="http://www.example.com">Example</a>'
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
for link in links:
    print(link.get('href'), link.text)

links = soup.select('a')
for link in links:
    href = link['href']
    text = link.get_text()
    print(href, text)

import csv

data = [['url', 'text'], [href, text]]
with open('output.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerows(data)

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com']

    def parse(self, response):
        # 处理响应
        # 提取数据
        # 发送更多请求
        pass

import requests
from bs4 import BeautifulSoup

url = 'http://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 处理页面，提取数据

from selenium import webdriver

driver = webdriver.Chrome('path/to/chromedriver')
driver.get('http://www.example.com')
# 处理页面，提取数据
driver.quit()

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)

import requests

login_url = 'https://example.com/login'
data = {
    'username': 'your_username',
    'password': 'your_password'
}
response = requests.post(login_url, data=data)
session = response.cookies

data_url = 'https://example.com/data'
response = requests.get(data_url, cookies=session)
data = response.text

import requests
from bs4 import BeautifulSoup
import time

# 发送 HTTP 请求
url = 'https://www.jianshu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
    response = requests.get(url, headers=headers, timeout=10)
    html = response.text
except Exception as e:
    print(f"Failed to fetch page: {e}")
    exit()

# 解析 HTML 内容
soup = BeautifulSoup(html, 'html.parser')

# 提取数据
articles = soup.select('.note-list li')
data = []

for article in articles:
    try:
        title_elem = article.select('a.title')[0]
        author_elem = article.select('.name')[0]
        
        title = title_elem.string.strip() if title_elem.string else ''
        author = author_elem.string.strip() if author_elem.string else ''
        href = 'https://www.jianshu.com' + title_elem['href']
        
        data.append([title, author, href])
    except IndexError:
        continue
    time.sleep(1)  # 控制请求频率

# 数据存储
import csv
with open('jianshu_articles.csv', 'w', newline='', encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Author', 'Link'])
    writer.writerows(data)
print("Data saved successfully.")

Python 爬虫核心技术原理与实战解析

导言

1. HTTP 请求与响应

1.1 HTTP 请求

1.2 HTTP 响应

更多推荐文章

相关免费在线工具

2. 网页解析技术

2.1 正则表达式

2.2 XPath

2.3 BeautifulSoup

2.4 提取数据

2.5 数据存储与再处理

3. 爬虫框架

3.1 Scrapy

3.2 BeautifulSoup + requests

3.3 Selenium

4. 其他关键技术

4.1 User-Agent 伪装

4.2 反爬虫策略与解决方法

4.3 网页登录与 Session 管理

4.4 Robots 协议与合规性

5. 实例：爬取简书网站文章信息

结语

更多推荐文章

相关免费在线工具

Python 爬虫核心技术原理与实战解析

导言

1. HTTP 请求与响应

1.1 HTTP 请求

1.2 HTTP 响应

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2. 网页解析技术

2.1 正则表达式

2.2 XPath

2.3 BeautifulSoup

2.4 提取数据

2.5 数据存储与再处理

3. 爬虫框架

3.1 Scrapy

3.2 BeautifulSoup + requests

3.3 Selenium

4. 其他关键技术

4.1 User-Agent 伪装

4.2 反爬虫策略与解决方法

4.3 网页登录与 Session 管理

4.4 Robots 协议与合规性

5. 实例：爬取简书网站文章信息

结语

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具