Python 爬虫实战：精准抓取携程旅行酒店价格数据

Ne0inhk

15 Mar 2026 — 14 min read

大会官网：https://ais.cn/u/Y3aAzy

会议时间：2026年2月6-8日

会议地点：中国-广州

前言

携程旅行作为国内领先的在线旅游平台，其酒店价格数据包含实时房价、房型信息、优惠活动、用户评分等核心维度，是旅游数据分析、价格监控、竞品分析的重要数据源。相较于博客园的静态页面，携程酒店页面融合了动态加载、反爬验证、数据加密等机制，抓取难度更高。本文将从页面分析、反反爬策略、动态数据抓取等维度，系统讲解如何使用 Python 实现携程酒店价格数据的高效抓取，帮助开发者突破平台限制，获取结构化的酒店价格信息。

摘要

本文聚焦携程旅行酒店价格爬虫的全流程实现，核心涵盖动态页面数据抓取、请求头加密参数处理、分页与多城市数据抓取三大核心技术点，通过requests库发送 HTTP 请求、jsonpath解析 JSON 数据、fake-useragent伪装请求特征，结合实战案例完成酒店名称、价格、房型、评分、位置等核心字段的抓取。实战目标链接：携程酒店查询页，最终实现支持多城市、多日期的酒店价格爬虫脚本，并提供数据清洗与可视化基础方案。

一、技术原理与环境准备

1.1 核心技术原理

携程酒店价格抓取的核心难点与解决思路：

酒店列表数据通过异步接口返回 JSON 格式，而非直接渲染在 HTML 中，需先抓包定位数据接口；
请求头包含Referer、Origin等校验字段，且部分接口需携带Cookie才能返回完整数据；
高频请求会触发 IP 封禁或验证码，需结合代理 IP、请求延迟、UA 随机化等反反爬策略；
价格数据实时变动，接口参数包含时间戳、城市编码等动态信息，需动态构造请求参数。

1.2 环境配置

工具 / 库	版本	作用
Python	3.8+	核心开发语言
requests	2.31.0	发送 HTTP 请求
jsonpath	0.82	解析 JSON 数据（高效提取嵌套字段）
fake-useragent	1.4.0	随机生成 User-Agent
python-dotenv	1.0.0	管理环境变量（存储敏感参数）
pandas	2.1.4	数据清洗与结构化存储

环境安装命令

bash

运行

pip install requests jsonpath fake-useragent python-dotenv pandas

二、实战开发：携程酒店价格爬虫

2.1 核心思路拆解

抓包定位携程酒店列表数据接口（通过 Chrome F12 Network 面板）；
分析接口请求参数（城市编码、入住 / 离店日期、页码等）；
构造合规的请求头与请求参数，模拟真实请求；
发送请求获取 JSON 数据，使用 jsonpath 提取核心字段；
实现多页数据自动抓取，处理接口返回的分页标识；
数据清洗与结构化存储，输出 Excel/JSON 格式结果。

2.2 完整代码实现

python

运行

import requests import json import time import os from fake_useragent import UserAgent from jsonpath import jsonpath import pandas as pd from dotenv import load_dotenv # 加载环境变量（避免硬编码敏感参数） load_dotenv() class CtripHotelSpider: def __init__(self): """初始化爬虫配置""" # 基础配置 self.ua = UserAgent() self.headers = { "User-Agent": self.ua.random, "Referer": "https://hotels.ctrip.com/", "Origin": "https://hotels.ctrip.com", "Accept": "application/json, text/plain, */*", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Connection": "keep-alive", # Cookie可从浏览器抓包获取，建议存入.env文件 "Cookie": os.getenv("CTRIP_COOKIE", ""), "Content-Type": "application/json;charset=UTF-8" } self.timeout = 15 # 请求超时时间 self.delay = 3 # 请求延迟（携程反爬严格，建议3-5秒） self.all_hotel_data = [] # 存储所有酒店数据 # 接口配置（携程酒店列表接口，已简化） self.api_url = "https://hotels.ctrip.com/hotel/api/search/poiList" def build_request_params(self, city_code="北京", check_in="2026-01-20", check_out="2026-01-22", page=1): """ 构造接口请求参数 :param city_code: 城市名称/编码（支持中文或城市编码，如北京=1） :param check_in: 入住日期（YYYY-MM-DD） :param check_out: 离店日期（YYYY-MM-DD） :param page: 页码 :return: 请求参数字典 """ params = { "cityId": self.get_city_code(city_code), # 转换城市名称为编码 "checkIn": check_in, "checkOut": check_out, "pageNum": page, "pageSize": 20, # 每页20条数据 "sortType": "9", # 9=价格从低到高，1=推荐排序 "allianceid": "4902", "sid": "29155245", "callback": "jQuery1830" + str(int(time.time() * 1000)), # 动态回调参数 "_": int(time.time() * 1000) # 时间戳参数 } return params def get_city_code(self, city_name): """ 简易城市名称转编码（可扩展完整城市编码表） :param city_name: 城市名称 :return: 城市编码 """ city_code_map = { "北京": 1, "上海": 2, "广州": 3, "深圳": 4, "杭州": 5, "成都": 6 } return city_code_map.get(city_name, 1) # 默认返回北京编码 def get_hotel_data(self, params): """ 发送请求获取酒店数据 :param params: 请求参数 :return: 解析后的JSON数据/None """ try: # 延迟请求，规避反爬 time.sleep(self.delay) # 发送GET请求 response = requests.get( url=self.api_url, headers=self.headers, params=params, timeout=self.timeout ) # 检查请求状态 response.raise_for_status() # 处理携程接口的JSONP格式返回（去除callback包裹） response_text = response.text if "jQuery" in response_text: # 提取JSON核心部分 json_start = response_text.find("(") + 1 json_end = response_text.rfind(")") json_data = json.loads(response_text[json_start:json_end]) else: json_data = response.json() return json_data except requests.exceptions.RequestException as e: print(f"请求接口失败：{str(e)}") return None except json.JSONDecodeError as e: print(f"JSON解析失败：{str(e)}") return None def parse_hotel_data(self, json_data): """ 解析酒店数据，提取核心字段 :param json_data: 接口返回的JSON数据 :return: 结构化酒店数据列表 """ if not json_data or json_data.get("code") != 200: print("接口返回异常或无数据") return [] hotel_list = [] # 使用jsonpath提取酒店列表（适配嵌套JSON结构） hotels = jsonpath(json_data, "$..hotelList")[0] if jsonpath(json_data, "$..hotelList") else [] for hotel in hotels: try: # 提取核心字段（使用jsonpath避免KeyError） hotel_name = jsonpath(hotel, "$.hotelName")[0] if jsonpath(hotel, "$.hotelName") else "未知酒店" hotel_id = jsonpath(hotel, "$.hotelId")[0] if jsonpath(hotel, "$.hotelId") else "" price = jsonpath(hotel, "$.lowPrice")[0] if jsonpath(hotel, "$.lowPrice") else 0 score = jsonpath(hotel, "$.hotelScore")[0] if jsonpath(hotel, "$.hotelScore") else 0.0 address = jsonpath(hotel, "$.address")[0] if jsonpath(hotel, "$.address") else "未知地址" star_rating = jsonpath(hotel, "$.starRating")[0] if jsonpath(hotel, "$.starRating") else "无星级" room_type = jsonpath(hotel, "$.roomTypeName")[0] if jsonpath(hotel, "$.roomTypeName") else "未知房型" distance = jsonpath(hotel, "$.distance")[0] if jsonpath(hotel, "$.distance") else "未知距离" # 封装酒店数据 hotel_info = { "酒店名称": hotel_name, "酒店ID": hotel_id, "最低价格(元)": price, "酒店评分": score, "酒店星级": star_rating, "房型": room_type, "地址": address, "距离商圈(km)": distance } hotel_list.append(hotel_info) except Exception as e: print(f"解析单条酒店数据失败：{str(e)}") continue return hotel_list def crawl_hotels(self, city="北京", check_in="2026-01-20", check_out="2026-01-22", max_page=3): """ 批量抓取指定城市的酒店数据 :param city: 城市名称 :param check_in: 入住日期 :param check_out: 离店日期 :param max_page: 最大抓取页数 :return: 所有酒店数据 """ print(f"开始抓取{city}市{check_in}至{check_out}的酒店数据，共抓取{max_page}页...") for page in range(1, max_page + 1): print(f"正在抓取第{page}页...") # 构造请求参数 params = self.build_request_params(city, check_in, check_out, page) # 获取接口数据 json_data = self.get_hotel_data(params) # 解析酒店数据 page_hotels = self.parse_hotel_data(json_data) if not page_hotels: print(f"第{page}页无数据，停止抓取") break # 添加到总列表 self.all_hotel_data.extend(page_hotels) print(f"第{page}页抓取完成，共{len(page_hotels)}家酒店") print(f"抓取完成！总计获取{len(self.all_hotel_data)}家酒店数据") return self.all_hotel_data def save_to_excel(self, file_path="ctrip_hotel_prices.xlsx"): """ 将酒店数据保存为Excel文件（便于数据分析） :param file_path: 保存路径 """ if not self.all_hotel_data: print("无数据可保存") return try: # 转换为DataFrame df = pd.DataFrame(self.all_hotel_data) # 按价格升序排序 df = df.sort_values(by="最低价格(元)", ascending=True) # 保存为Excel df.to_excel(file_path, index=False, engine="openpyxl") print(f"数据已成功保存到：{file_path}") except Exception as e: print(f"保存Excel失败：{str(e)}") def print_sample_results(self, sample_num=5): """ 打印示例结果（前N条） :param sample_num: 示例数量 """ if not self.all_hotel_data: print("无抓取结果") return print(f"\n===== 携程酒店价格抓取结果（前{sample_num}条） =====") # 转换为DataFrame便于格式化输出 df = pd.DataFrame(self.all_hotel_data[:sample_num]) print(df.to_string(index=False)) if __name__ == "__main__": # 实例化爬虫 spider = CtripHotelSpider() # 抓取北京2026-01-20至2026-01-22的酒店数据（前3页） spider.crawl_hotels( city="北京", check_in="2026-01-20", check_out="2026-01-22", max_page=3 ) # 打印示例结果 spider.print_sample_results(sample_num=5) # 保存数据到Excel spider.save_to_excel()

2.3 环境变量配置（.env 文件）

在项目根目录创建.env文件，填入携程 Cookie（从浏览器抓包获取）：

env

CTRIP_COOKIE="你的携程Cookie"

2.4 代码输出结果示例

plaintext

开始抓取北京市2026-01-20至2026-01-22的酒店数据，共抓取3页... 正在抓取第1页... 第1页抓取完成，共20家酒店 正在抓取第2页... 第2页抓取完成，共20家酒店 正在抓取第3页... 第3页抓取完成，共20家酒店 抓取完成！总计获取60家酒店数据 ===== 携程酒店价格抓取结果（前5条） ===== 酒店名称 酒店ID 最低价格(元) 酒店评分 酒店星级 房型 地址 距离商圈(km) 0 北京如家快捷酒店（天安门店） 12345678 199 4.5 二星级 标准双人间 北京市东城区东单大街1号 0.5 1 北京7天连锁酒店（王府井店） 12345679 219 4.3 二星级 大床房A 北京市东城区王府井大街88号 0.8 2 北京汉庭酒店（西单店） 12345680 249 4.6 三星级 高级大床房 北京市西城区西单北大街12号 1.2 3 北京锦江之星（北京站店） 12345681 269 4.4 三星级 商务标准间 北京市东城区北京站东街5号 1.5 4 北京全季酒店（崇文门店） 12345682 299 4.8 四星级 豪华大床房 北京市东城区崇文门西大街9号 1.0 数据已成功保存到：ctrip_hotel_prices.xlsx

2.5 核心代码原理说明

代码模块	核心原理	关键作用
`__init__` 方法	初始化请求头（含随机 UA、Cookie）、接口地址等	模拟真实用户请求，通过 Cookie 绕过基础鉴权
`build_request_params` 方法	动态构造接口参数（含时间戳、城市编码）	适配携程接口的动态参数校验，支持多城市 / 多日期
`get_hotel_data` 方法	发送 GET 请求，处理 JSONP 格式返回数据	解决携程接口 JSONP 包裹问题，正确解析数据
`parse_hotel_data` 方法	使用 jsonpath 提取嵌套 JSON 字段	避免 KeyError，高效解析多层嵌套的酒店数据
`crawl_hotels` 方法	循环构造分页参数，批量抓取多页数据	实现自动化分页抓取，支持终止无数据页面
`save_to_excel` 方法	转换为 DataFrame 并排序后保存为 Excel	结构化存储数据，便于后续价格分析 / 可视化

三、反反爬优化策略

3.1 IP 代理池集成

携程对单 IP 高频请求敏感，可集成代理 IP 池轮换 IP：

python

运行

def get_proxy(self): """从代理池获取代理IP（示例，需替换为实际代理池接口）""" try: proxy_response = requests.get("http://你的代理池接口/get_proxy") proxy = proxy_response.json().get("proxy") return {"http": f"http://{proxy}", "https": f"https://{proxy}"} except Exception as e: print(f"获取代理失败：{str(e)}") return None # 在get_hotel_data方法中添加代理参数 # proxies = self.get_proxy() # response = requests.get(url=self.api_url, headers=self.headers, params=params, proxies=proxies, timeout=self.timeout)

Cookie 过期后可通过 Selenium 模拟登录获取新 Cookie：

python

运行

from selenium import webdriver from selenium.webdriver.common.by import By import time def refresh_cookie(self): """模拟登录携程获取新Cookie""" driver = webdriver.Chrome() driver.get("https://www.ctrip.com/") print("请手动完成登录，30秒后继续...") time.sleep(30) # 获取登录后的Cookie cookies = driver.get_cookies() cookie_str = "; ".join([f"{c['name']}={c['value']}" for c in cookies]) self.headers["Cookie"] = cookie_str # 保存Cookie到.env文件 with open(".env", "w", encoding="utf-8") as f: f.write(f'CTRIP_COOKIE="{cookie_str}"') driver.quit() print("Cookie已更新并保存")

3.3 请求频率动态调整

根据接口返回状态动态调整请求延迟：

python

运行

def adjust_delay(self, response_status): """ 根据响应状态调整延迟 :param response_status: 响应状态码 """ if response_status == 429: # 请求过于频繁 self.delay += 2 print(f"请求被限制，延迟调整为{self.delay}秒") elif response_status == 200 and self.delay > 3: self.delay -= 0.5 print(f"请求正常，延迟调整为{self.delay}秒")

四、数据扩展与分析

4.1 价格趋势监控

扩展爬虫实现定时抓取，监控价格变动：

python

运行

import schedule def scheduled_crawl(): """定时抓取任务""" spider = CtripHotelSpider() spider.crawl_hotels(city="北京", check_in="2026-01-20", check_out="2026-01-22", max_page=1) spider.save_to_excel(f"ctrip_hotel_prices_{time.strftime('%Y%m%d_%H%M%S')}.xlsx") # 每天9点和18点抓取一次 schedule.every().day.at("09:00").do(scheduled_crawl) schedule.every().day.at("18:00").do(scheduled_crawl) # 启动定时任务 while True: schedule.run_pending() time.sleep(60)

4.2 价格可视化分析

使用matplotlib绘制酒店价格分布直方图：

bash

运行

pip install matplotlib

python

运行

import matplotlib.pyplot as plt def visualize_price_distribution(self): """可视化酒店价格分布""" if not self.all_hotel_data: print("无数据可可视化") return # 设置中文字体 plt.rcParams["font.sans-serif"] = ["SimHei"] plt.rcParams["axes.unicode_minus"] = False # 提取价格数据 prices = [hotel["最低价格(元)"] for hotel in self.all_hotel_data] # 绘制直方图 plt.figure(figsize=(10, 6)) plt.hist(prices, bins=10, color="skyblue", edgecolor="black") plt.title("携程北京酒店价格分布") plt.xlabel("价格（元）") plt.ylabel("酒店数量") plt.grid(axis="y", alpha=0.75) plt.savefig("hotel_price_distribution.png") plt.show()

五、注意事项与合规声明

合规性：抓取携程酒店数据仅限个人学习、研究使用，不得用于商业竞争、价格篡改等违规场景，需遵守《网络安全法》及携程平台用户协议；
反爬尊重：严格控制请求频率（建议 3-5 秒 / 次），避免给携程服务器造成压力，不得使用爬虫进行恶意攻击；
数据时效性：酒店价格、库存等数据实时变动，抓取结果仅代表抓取时刻的状态；
Cookie 安全：Cookie 包含个人登录信息，切勿泄露给他人，定期更换以降低账号风险。

总结

携程酒店价格爬虫的核心是定位异步数据接口+构造合规请求参数，通过 jsonpath 解析嵌套 JSON 数据，实现精准字段提取；
反反爬的关键在于请求特征随机化（UA/IP 轮换）、频率动态调整、Cookie 有效期管理；
抓取的数据可扩展实现价格监控、可视化分析等功能，满足旅游数据分析的核心需求，同时需严格遵守平台规则与法律法规。

Python 爬虫实战：精准抓取携程旅行酒店价格数据

Ne0inhk

前言

摘要

一、技术原理与环境准备

1.1 核心技术原理

1.2 环境配置

环境安装命令

二、实战开发：携程酒店价格爬虫

2.1 核心思路拆解

2.2 完整代码实现

2.3 环境变量配置（.env 文件）

2.4 代码输出结果示例

2.5 核心代码原理说明

三、反反爬优化策略

3.1 IP 代理池集成

3.3 请求频率动态调整

四、数据扩展与分析

4.1 价格趋势监控

4.2 价格可视化分析

五、注意事项与合规声明

总结

Read more

极致性能的服务器Redis之Hash类型及相关指令介绍

Engram 中的多头哈希理解举例

数据结构：手撕堆和哈希表，字符串哈希详解----小白也能懂

小红书笔记详情API接口基础解析：数据结构与调用方式

前言

摘要

一、技术原理与环境准备

1.1 核心技术原理

1.2 环境配置

环境安装命令

二、实战开发：携程酒店价格爬虫

2.1 核心思路拆解

2.2 完整代码实现

2.3 环境变量配置（.env 文件）

2.4 代码输出结果示例

2.5 核心代码原理说明

三、反反爬优化策略

3.1 IP 代理池集成

3.2 Cookie 自动更新

3.3 请求频率动态调整

四、数据扩展与分析

4.1 价格趋势监控

4.2 价格可视化分析

五、注意事项与合规声明

总结

Read more

极致性能的服务器Redis之Hash类型及相关指令介绍

Engram 中的多头哈希理解举例

数据结构：手撕堆和哈希表，字符串哈希详解----小白也能懂

小红书笔记详情API接口基础解析：数据结构与调用方式