基于 DrissionPage 的抖音评论数据自动化采集方案

基于 DrissionPage 框架实现抖音评论数据的自动化采集。通过监听网络请求获取原始 JSON 数据，结合浏览器渲染提取视频及作者元信息，最终将结构化数据存储至 CSV 文件。方案包含环境配置、核心代码实现、关键模块解析及使用注意事项，支持分页爬取与异常处理，适用于短视频数据分析场景。

疯疯癫癫发布于 2026/3/16更新于 2026/7/2776 浏览

在短视频数据分析场景中，抖音评论数据的采集是重要的一环。本文将基于 DrissionPage 自动化框架，实现抖音视频评论、作者信息、视频元数据的完整爬取，并将数据结构化存储到 CSV 文件中。

一、技术选型与核心优势

1. 核心依赖库

DrissionPage：新一代 Python 自动化工具，融合了 Selenium 和 Requests 的优势，既可以操作浏览器渲染页面，又能监听网络请求获取原始数据，相比传统爬虫更稳定。
datetime/re：分别用于时间戳转换和文本正则提取。
csv：Python 内置模块，用于结构化存储爬取的数据。
time：用于添加等待时间，适配页面加载节奏。

2. 方案优势

采用网络请求监听而非页面元素解析获取评论，数据更完整、效率更高；
精准的元素定位策略，适配抖音页面结构；
完善的异常处理机制，降低爬取中断概率；
结构化数据输出，便于后续分析。

二、核心功能实现

1. 环境准备

首先安装核心依赖：

pip install DrissionPage

2. 完整代码实现

# 导入自动化模块
from DrissionPage import ChromiumPage
# 导入时间转换模块
from datetime import datetime
# 导入 csv 模块，用于操作 CSV 文件
import csv
# 导入时间模块，用于添加短暂等待，提高稳定性
import time
import re

# 提取视频标题和标签
def extract_video_info(page):
    """提取视频标题和以#开头的标签"""
    try:
        # 获取视频标题
        title_ele = page.ele('tag:h1', timeout=5)
        title = title_ele.text.strip() if title_ele else '未知标题'
        # 提取以#开头的标签
        tag_pattern = re.compile(r'#\S+')
        tags = tag_pattern.findall(title)
        # 纯标题（去掉标签）
        pure_title = tag_pattern.sub(, title).strip()  title
         pure_title, tags
     Exception  e:
        ()
         , []

 ():
    
    author_info = {
        : ,
        : ,
        : 
    }
    :
        
        
        author_ele = page.ele(, timeout=)
         author_ele:
            
            author_name = author_ele.text.strip()
             author_name:
                author_info[] = author_name

        
        stat_eles = page.eles()
         ele  stat_eles:
            stat_text = ele.text.strip()
               stat_text:
                 stat_text.replace(, ).strip().isdigit()    stat_text    stat_text:
                    author_info[] = stat_text.replace(, ).strip()
                :
                    next_ele = ele.()
                     next_ele  next_ele.text.strip():
                        author_info[] = next_ele.text.strip()
               stat_text    stat_text:
                 stat_text.replace(, ).replace(, ).strip().isdigit()    stat_text    stat_text:
                    author_info[] = stat_text.replace(, ).replace(, ).strip()
                :
                    next_ele = ele.()
                     next_ele  next_ele.text.strip():
                        author_info[] = next_ele.text.strip()
     Exception  e:
        ()
     author_info

 ():
    
    video_url = ().strip()
      video_url:
        video_url = 

    
     (, mode=, encoding=, newline=)  f:
        csv_writer = csv.DictWriter(f, fieldnames=[
            , , , , ,
            , , , 
        ])
        
        csv_writer.writeheader()

        
        dp = ChromiumPage()
        
        success_page_count = 
        
        video_title = 
        video_tags = []
        author_info = {: , : , : }

        :
            
            dp.listen.start()
            
            dp.get(video_url)
            
            time.sleep()

            
            video_title, video_tags = extract_video_info(dp)
            
            author_info = extract_author_info(dp)

            ()
            ()
            ()
            ()
            ()
            ()
            ()

            
            page_num = 
            has_next_page = 

            
             has_next_page:
                ()
                
                resp = dp.listen.wait(timeout=)
                  resp:
                    ()
                    
                    dp.scroll.to_bottom()
                    time.sleep()
                    resp = dp.listen.wait(timeout=)
                      resp:
                        ()
                        

                
                :
                    json_data = resp.response.body
                    
                    comments = json_data.get(, [])
                    
                      comments:
                        ()
                        

                    
                     index  comments:
                        :
                            create_time = index.get(, )
                            
                             create_time == :
                                date = 
                            :
                                date = (datetime.fromtimestamp(create_time))

                            
                            region = index.get(, )
                              region:
                                ip_client_info = index.get(, {})
                                region = ip_client_info.get(, )
                         KeyError  e:
                            ()
                            

                        dit = {
                            : video_title,
                            : .join(video_tags),
                            : author_info[],
                            : author_info[],
                            : author_info[],
                            : index.get(, {}).get(, ),
                            : region,
                            : date,
                            : index.get(, ),
                        }
                        :
                            csv_writer.writerow(dit)
                            (dit)
                         Exception  e:
                            ()

                    
                    success_page_count += 

                    
                    next_page = dp.ele(, timeout=)
                      next_page:
                        ()
                        

                    
                    :
                        dp.scroll.to_see(next_page)
                        time.sleep()
                        next_page.click()
                        page_num += 
                        time.sleep()
                     Exception  e:
                        ()
                        
                 Exception  e:
                    ()
                    

            ()
            ()
            ()
         Exception  e:
            ()
        :
            
            dp.quit()
            ()

 __name__ == :
    main()

基于 DrissionPage 的抖音评论数据自动化采集方案

疯疯癫癫发布于 2026/3/16更新于 2026/7/2776 浏览

# 导入自动化模块 from DrissionPage import ChromiumPage # 导入时间转换模块 from datetime import datetime # 导入 csv 模块，用于操作 CSV 文件 import csv # 导入时间模块，用于添加短暂等待，提高稳定性 import time import re # 提取视频标题和标签 def extract_video_info(page): """提取视频标题和以#开头的标签""" try: # 获取视频标题 title_ele = page.ele('tag:h1', timeout=5) title = title_ele.text.strip() if title_ele else '未知标题' # 提取以#开头的标签 tag_pattern = re.compile(r'#\S+') tags = tag_pattern.findall(title) # 纯标题（去掉标签） pure_title = tag_pattern.sub(, title).strip() title pure_title, tags Exception e: () , [] (): author_info = { : , : , : } : author_ele = page.ele(, timeout=) author_ele: author_name = author_ele.text.strip() author_name: author_info[] = author_name stat_eles = page.eles() ele stat_eles: stat_text = ele.text.strip() stat_text: stat_text.replace(, ).strip().isdigit() stat_text stat_text: author_info[] = stat_text.replace(, ).strip() : next_ele = ele.() next_ele next_ele.text.strip(): author_info[] = next_ele.text.strip() stat_text stat_text: stat_text.replace(, ).replace(, ).strip().isdigit() stat_text stat_text: author_info[] = stat_text.replace(, ).replace(, ).strip() : next_ele = ele.() next_ele next_ele.text.strip(): author_info[] = next_ele.text.strip() Exception e: () author_info (): video_url = ().strip() video_url: video_url = (, mode=, encoding=, newline=) f: csv_writer = csv.DictWriter(f, fieldnames=[ , , , , , , , , ]) csv_writer.writeheader() dp = ChromiumPage() success_page_count = video_title = video_tags = [] author_info = {: , : , : } : dp.listen.start() dp.get(video_url) time.sleep() video_title, video_tags = extract_video_info(dp) author_info = extract_author_info(dp) () () () () () () () page_num = has_next_page = has_next_page: () resp = dp.listen.wait(timeout=) resp: () dp.scroll.to_bottom() time.sleep() resp = dp.listen.wait(timeout=) resp: () : json_data = resp.response.body comments = json_data.get(, []) comments: () index comments: : create_time = index.get(, ) create_time == : date = : date = (datetime.fromtimestamp(create_time)) region = index.get(, ) region: ip_client_info = index.get(, {}) region = ip_client_info.get(, ) KeyError e: () dit = { : video_title, : .join(video_tags), : author_info[], : author_info[], : author_info[], : index.get(, {}).get(, ), : region, : date, : index.get(, ), } : csv_writer.writerow(dit) (dit) Exception e: () success_page_count += next_page = dp.ele(, timeout=) next_page: () : dp.scroll.to_see(next_page) time.sleep() next_page.click() page_num += time.sleep() Exception e: () Exception e: () () () () Exception e: () : dp.quit() () __name__ == : main()

基于 DrissionPage 的抖音评论数据自动化采集方案

一、技术选型与核心优势

1. 核心依赖库

2. 方案优势

二、核心功能实现

1. 环境准备

2. 完整代码实现

基于 DrissionPage 的抖音评论数据自动化采集方案

一、技术选型与核心优势

1. 核心依赖库

2. 方案优势

二、核心功能实现

1. 环境准备

2. 完整代码实现

3. 示例链接

4. 代码运行效果

5. CSV 导出效果

三、关键模块解析

1. 视频信息提取（extract_video_info）

2. 作者信息提取（extract_author_info）

3. 核心爬取逻辑（main 函数）

（1）网络请求监听

（2）分页处理

（3）数据存储

四、使用说明与注意事项

1. 使用步骤

2. 注意事项

五、功能扩展方向

总结

更多推荐文章

相关免费在线工具

基于 DrissionPage 的抖音评论数据自动化采集方案

一、技术选型与核心优势

1. 核心依赖库

2. 方案优势

二、核心功能实现

1. 环境准备

2. 完整代码实现

基于 DrissionPage 的抖音评论数据自动化采集方案

一、技术选型与核心优势

1. 核心依赖库

2. 方案优势

二、核心功能实现

1. 环境准备

2. 完整代码实现

3. 示例链接

4. 代码运行效果

5. CSV 导出效果

三、关键模块解析

1. 视频信息提取（extract_video_info）

2. 作者信息提取（extract_author_info）

3. 核心爬取逻辑（main 函数）

（1）网络请求监听

（2）分页处理

（3）数据存储

四、使用说明与注意事项

1. 使用步骤

2. 注意事项

五、功能扩展方向

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具