Python 爬虫入门教程：从零开始学习网络数据采集 | 极客日志

Python算法

Python 爬虫入门教程：从零开始学习网络数据采集

介绍 Python 网络爬虫的基础原理与实现方法。涵盖环境搭建、使用 Requests 库发送 HTTP 请求、利用 BeautifulSoup 解析 HTML 结构、通过 Selenium 处理 JavaScript 动态加载页面，以及将数据保存为 JSON 或 CSV 格式。教程强调编写爬虫时需遵守目标网站的 robots.txt 协议及相关法律法规，避免高频访问导致封禁，并提供完整的代码示例与错误处理方案，适合初学者系统掌握爬虫技术。

Stephaine Walsh发布于 2025/2/7更新于 2026/5/3019 浏览

Python 爬虫入门教程：从零开始学习网络数据采集

随着互联网技术的飞速发展，数据已成为信息时代的核心资产。网络爬虫（Web Scraper）作为一种自动化采集网络数据的工具，在数据分析、市场调研及学术研究等领域发挥着重要作用。Python 凭借其简洁的语法和强大的生态库，成为编写网络爬虫的首选语言。

本教程将带你从零开始学习 Python 爬虫，掌握基本的爬虫技术、解析方法及数据存储方案，并强调合规使用的重要性。

1. 准备工作

在学习 Python 爬虫之前，建议具备以下基础知识：

Python 编程基础：熟悉基本语法、变量、函数、列表、字典等数据结构。
网络基础知识：了解 HTTP 协议、请求方法（GET/POST）、状态码、HTML/CSS 结构等基本概念。

1.1 环境搭建

确保已安装 Python 3.x 版本。推荐使用虚拟环境管理依赖，例如使用 venv 或 conda。

python -m venv crawler_env
source crawler_env/bin/activate  # Windows: crawler_env\Scripts\activate

1.2 安装必要库

常用的爬虫相关库包括 requests（发送请求）、beautifulsoup4（解析 HTML）、selenium（处理动态页面）。

pip install requests beautifulsoup4 selenium lxml

2. 编写第一个爬虫程序

接下来，我们将编写一个简单的爬虫程序，用于获取网页内容并解析其中的信息。

2.1 发送 HTTP 请求

使用 requests 库发送 GET 请求。为了模拟真实浏览器行为，通常需要设置 User-Agent 头部，避免被服务器拦截。

import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()  # 如果状态码不是 200，则抛出异常
    response.encoding = response.apparent_encoding  # 自动识别编码
    
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title.string if soup.title else 'No Title'
    print(f"网页标题：")
 Exception  e:
    ()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online
Markdown转HTML
将 Markdown（GFM）转为 HTML 片段，浏览器内 marked 解析；与 HTML转Markdown 互为补充。在线工具，Markdown转HTML在线工具，online

# 假设 html_doc 为获取到的网页源码
soup = BeautifulSoup(html_doc, 'html.parser')

# 提取所有链接
links = soup.find_all('a')
for link in links[:5]:  # 仅打印前 5 个
    href = link.get('href')
    text = link.get_text(strip=True)
    print(f"链接：{href}, 文本：{text}")

# 提取段落内容
paragraphs = soup.find_all('p')
for p in paragraphs:
    content = p.get_text(strip=True)
    if content:
        print(f"段落：{content[:50]}...")  # 限制输出长度

# 查找 class 为 'content' 的 div 下的所有 h2 标签
items = soup.select('div.content h2')
for item in items:
    print(item.get_text())

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 初始化 Chrome 驱动
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 无头模式，不显示浏览器窗口
driver = webdriver.Chrome(options=options)

try:
    driver.get('http://example.com')
    
    # 等待特定元素出现
    wait = WebDriverWait(driver, 10)
    element = wait.until(EC.presence_of_element_located((By.ID, "main-content")))
    
    html = driver.page_source
    print(html[:500])  # 打印部分源码
finally:
    driver.quit()

import json

data = [
    {"id": 1, "name": "Item A", "price": 100},
    {"id": 2, "name": "Item B", "price": 200}
]

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

import csv

with open('data.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.DictWriter(f, fieldnames=['id', 'name', 'price'])
    writer.writeheader()
    writer.writerows(data)

session = requests.Session()
response = session.get(url, headers=headers)

import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retry = Retry(total=3, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)

Python 爬虫入门教程：从零开始学习网络数据采集

Python 爬虫入门教程：从零开始学习网络数据采集

1. 准备工作

1.1 环境搭建

1.2 安装必要库

2. 编写第一个爬虫程序

2.1 发送 HTTP 请求

更多推荐文章

相关免费在线工具

3. 解析网页内容

3.1 标签导航与查找

3.2 CSS 选择器

4. 处理动态内容

4.1 Selenium 基础用法

5. 数据存储

5.1 保存为 JSON

5.2 保存为 CSV

6. 进阶技术与最佳实践

6.1 Session 管理

6.2 反爬虫策略应对

6.3 错误重试机制

7. 法律与道德规范

结语

更多推荐文章

相关免费在线工具

Python 爬虫入门教程：从零开始学习网络数据采集

Python 爬虫入门教程：从零开始学习网络数据采集

1. 准备工作

1.1 环境搭建

1.2 安装必要库

2. 编写第一个爬虫程序

2.1 发送 HTTP 请求

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. 解析网页内容

3.1 标签导航与查找

3.2 CSS 选择器

4. 处理动态内容

4.1 Selenium 基础用法

5. 数据存储

5.1 保存为 JSON

5.2 保存为 CSV

6. 进阶技术与最佳实践

6.1 Session 管理

6.2 反爬虫策略应对

6.3 错误重试机制

7. 法律与道德规范

结语

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具