项目背景
看完爬虫入门教程后,为了巩固知识,决定进行实战练习,目标网站为'百思不得姐',主要爬取段子文本。
页面分析
进入网站首页,页面包含大量段子。暂时仅爬取文本内容,暂不处理图片。查看页面源代码:
发现段子链接结构类似 <a href="(/detail-3242432.html)">段子</a>。使用正则表达式提取链接和文本:
reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
点赞数的 HTML 结构类似 <i class="icon-up ui-icon-up"></i> <span>(.*?)</span>,同样使用正则提取:
reg = re.compile(r'<i class="icon-up ui-icon-up"></i> <span>(.*?)</span>')
代码实现
# -*- coding: utf-8 -*-
import urllib2
import re
def getduan():
url = 'http://www.budejie.com/text/'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
res = response.read()
reg = re.compile(r'<a href="(/detail-.*?)">(.*?)</a>')
return re.findall(reg, res)
def up():
url = 'http://www.budejie.com/text/'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request)
res = response.read()
reg = re.compile(r'<i class="icon-up ui-icon-up"></i> <span>(.*?)</span>')
return re.findall(reg, res)
if __name__ == '__main__':
d = zip(getduan(), up())
d = dict(d)
count =
j, i d.items():
, (count+), j[]
count = count +
, i


