Python 网络爬虫实战：采集附近店铺信息并保存至 CSV

前言

本文将介绍如何使用 Python 编写一个简单的网络爬虫程序，用于采集附近店铺的详细信息。主要涉及使用 requests 库发送 HTTP 请求，利用正则表达式解析返回的数据，提取店铺名称、评分、地址及联系方式等字段，并通过 csv 模块将数据存储到本地文件中。

使用环境

Python 3.8+ 解释器
PyCharm 或 VS Code 编辑器

依赖模块

本示例主要使用以下标准库和第三方库：

requests：用于发送 HTTP 请求。
csv：用于将数据写入 CSV 文件。
re：用于正则表达式匹配。
time：用于控制请求频率。

安装命令：

pip install requests

基本思路

数据来源分析

在开始采集之前，需要明确目标网站的数据结构。通常通过开发者工具（F12）的 Network 面板分析接口请求。

确定目标 URL：找到获取店铺列表的 API 接口。
分析参数：查看请求所需的参数（如页码、关键词、Token 等）。
确认响应格式：通常为 JSON 格式，包含店铺的基本信息和详情页链接。

代码流程步骤

构建请求：设置 Headers（User-Agent, Referer 等）以模拟浏览器行为。
发送请求：使用 requests.get 或 requests.post 获取响应。
解析数据：从响应中提取关键字段（名称、价格、评分等）。
详情补充：根据详情页链接进一步获取电话、营业时间等信息。
保存数据：将整理好的字典列表写入 CSV 文件。
翻页处理：循环修改偏移量（offset）或页码，实现多页数据采集。

代码展示

以下是一个完整的示例代码，展示了如何采集店铺列表并保存为 CSV 文件。

import requests
import re
import csv
import time
import os

def get_shop_info(html_url):
    """
    获取单个店铺的详细信息（电话、地址、营业时间）
    :param html_url: 店铺详情页 URL
    :return: 包含信息的列表 [address, phone, openTime]
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
        'Referer': 'https://www.example.com/', 
    }
    :
        response = requests.get(url=html_url, headers=headers, timeout=)
        response.encoding = response.apparent_encoding
        text = response.text
        
        
        
        phone = re.findall(, text)
        open_time = re.findall(, text)
        address = re.findall(, text)
        
         phone  open_time  address:
             [address[], phone[], open_time[].replace(, )]
        :
             [, , ]
     Exception  e:
        ()
         [, , ]

 ():
    
    base_url =  
    output_file = 
    
    
    fieldnames = [
        , , , , 
        , , , , 
        , 
    ]
    
     (output_file, mode=, encoding=, newline=)  f:
        csv_writer = csv.DictWriter(f, fieldnames=fieldnames)
        
        
         os.path.getsize(output_file) == :
            csv_writer.writeheader()
        
        
         page  (, , ): 
            time.sleep() 
            
            params = {
                : ,
                : ,
                : ,
                : page,
                : ,
                : ,
                : 
            }
            
            headers = {
                : ,
                : 
            }
            
            :
                response = requests.get(url=base_url, params=params, headers=headers)
                result = response.json().get(, {}).get(, [])
                
                 index  result:
                    shop_id = index.get()
                    detail_url =  
                    
                    
                    shop_detail = get_shop_info(detail_url)
                    
                    data_dict = {
                        : index.get(),
                        : index.get(),
                        : index.get(),
                        : index.get(),
                        : index.get(),
                        : index.get(),
                        : shop_detail[],
                        : shop_detail[],
                        : shop_detail[],
                        : detail_url,
                    }
                    
                    csv_writer.writerow(data_dict)
                    ()
                    
             Exception  e:
                ()
                

 __name__ == :
    main()

Python 网络爬虫实战：采集附近店铺信息并保存至 CSV

前言

使用环境

依赖模块

基本思路

数据来源分析

代码流程步骤

代码展示

更多推荐文章

相关免费在线工具

注意事项与优化建议

总结

更多推荐文章

相关免费在线工具

Python 网络爬虫实战：采集附近店铺信息并保存至 CSV

前言

使用环境

依赖模块

基本思路

数据来源分析

代码流程步骤

代码展示

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

注意事项与优化建议

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具