Python 网络爬虫技术详解：原理、实战与最佳实践

Python 网络爬虫技术详解：原理、实战与最佳实践 | 极客日志

import requests

url = 'https://www.imdb.com/search/title?genres=action&title_type=feature'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.encoding = response.apparent_encoding

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='lister-item-content')

for movie in movies:
    try:
        title_tag = movie.find('h3').find('a')
        title = title_tag.text.strip() if title_tag else 'Unknown'
        
        genre_tag = movie.find('span', class_='genre')
        genre = genre_tag.text.strip() if genre_tag else 'N/A'
        
        rating_tag = movie.find('span', class_='inline-block ratings-imdb-rating')
        rating = rating_tag.text.strip() if rating_tag else 'N/A'
        
        link = title_tag['href'] if title_tag else ''
        full_url = 'https://www.imdb.com' + link
        
        print(f"Title: {title}, Genre: {genre}, Rating: {rating}")
    except AttributeError:
        continue

import pandas as pd

data_list = []
# ... 循环收集数据 ...
# data_list.append({'title': title, 'genre': genre, ...})

df = pd.DataFrame(data_list)
df.to_csv('imdb_movies.csv', index=False, encoding='utf-8-sig')

Python 网络爬虫技术详解：原理、实战与最佳实践

什么是 Python 爬虫？

常用工具与库

编写 Python 爬虫的步骤

1. 确定目标网站与分析结构

2. 发送请求获取网页

3. 解析网页提取数据

4. 数据存储

5. 循环与异常处理

Python 爬虫的最佳实践

进阶：处理动态内容与反爬策略

法律与道德合规

总结

更多推荐文章

相关免费在线工具

Python 网络爬虫技术详解：原理、实战与最佳实践

什么是 Python 爬虫？

常用工具与库

编写 Python 爬虫的步骤

1. 确定目标网站与分析结构

2. 发送请求获取网页

3. 解析网页提取数据

4. 数据存储

5. 循环与异常处理

Python 爬虫的最佳实践

进阶：处理动态内容与反爬策略

法律与道德合规

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具