这个教程使用BeautifulSoup库爬取指定贴吧的帖子信息。
本教程的代码托管于github: https://github.com/w392807287/spider_baidu_bar
数据分析部分请移步:
python版本:3.5.2
使用BeautifulSoup库获取网页信息
引入相关库:
from bs4 import BeautifulSoupfrom urllib.request import urlopenfrom urllib.error import HTTPError
这里使用python吧为例子,python吧的主页为:http://tieba.baidu.com/f?ie=utf-8&kw=python&fr=search,精简一点http://tieba.baidu.com/f?kw=python
获取BeautifulSoup对象:
url = "http://tieba.baidu.com/f?kw=python"html = urlopen(url).read()bsObj = BeautifulSoup(html, "lxml")
这里将这一段进行封装,封装成一个传入url返回bs对象的函数:
def get_bsObj(url): ''' 返回给定url的beautifulsoup对象 :param url:目标网址 :return:beautifulsoup对象 ''' try: html = urlopen(url).read() bsObj = BeautifulSoup(html, "lxml") return bsObj except HTTPError as e: print(e) return None
这个函数传入一个url,返回beautifulsoup对象,如果发生错误则打印出错误并返回空值。
贴吧主页的处理
在贴吧主页中,包含了这个贴吧的大概信息,比如:关注量,主题数量,帖子数量。我们将这些信息汇总到一个文件。
获取主页bs对象:
bsObj_mainpage = get_bsObj(url)
获取总页数:
last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])
获取最后一页的页数是为了后面爬去帖子时越界。其中使用了beautifulsoup的find方法。
获取需要的信息并写入文件:
red_text = bsObj_mainpage.findAll("span", {"class": "red_text"}) subject_sum = red_text[0].get_text() # 主题数 post_sum = red_text[1].get_text() # 帖子数 follow_sum = red_text[2].get_text() # 关注量 with open('main_info.txt','w+') as f: f.writelines("统计时间: "+str(datetime.datetime.now())+"\n") f.writelines("主题数: "+subject_sum+"\n") f.writelines("帖子数: "+post_sum+"\n") f.writelines("关注量: "+follow_sum+"\n")
最后将这些步骤封装成一个函数,传入主页面的url,写入信息,返回最后一页的页码:
def del_mainPage(url): ''' 对主页面进行处理,返回最后页码 :param url:目标主页地址 :return:返回最后页码,int ''' bsObj_mainpage = get_bsObj(url) last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1]) try: red_text = bsObj_mainpage.findAll("span", {"class": "red_text"}) subject_sum = red_text[0].get_text() # 主题数 post_sum = red_text[1].get_text() # 帖子数 follow_sum = red_text[2].get_text() # 关注量 except AttributeError as e: print("发生错误:" + e + "时间:"+str(datetime.datetime.now())) return None with open('main_info.txt','w+') as f: f.writelines("统计时间: "+str(datetime.datetime.now())+"\n") f.writelines("主题数: "+subject_sum+"\n") f.writelines("帖子数: "+post_sum+"\n") f.writelines("关注量: "+follow_sum+"\n") return last_page
得到的结果:
统计时间: 2016-10-07 15:14:19.642933主题数: 25083帖子数: 414831关注量: 76511
从页面中获取帖子地址
我们想要获取每个帖子的详细信息,就需要进入这个帖子,所以需要这个帖子的地址。比如:http://tieba.baidu.com/p/4700788764
这个url中,http://tieba.baidu.com是服务器地址,/p应该是路由中帖子(post)的对应,/4700788764即帖子的id。
我们观察贴吧首页会发现,每个帖子都是在一个块中。在浏览器中按F12观察其代码会发现,一个帖子所对应一个<li>标签,每个页面中有50个帖子(首页有广告贴可能不一样)。
其中<li>标签大概长这样:
我们可以观察到<li>标签中的class=" j_thread_list clearfix" data-field="{"id":4700788764,"author_name":"\u6768\u5175507","first_post_id":95008842757,"reply_num":2671,"is_bakan":null,"vid":"","is_good":null,"is_top":null,"is_protal":null,"is_membertop":null,"frs_tpoint":null}"
根据class属性我们可以找到单个页面中所有的帖子:
posts = bsObj_page.findAll("li", {"class": "j_thread_list"})
在data-field属性中我们可以得到:帖子ID,作者名称,回复数量,是否精品等信息。根据帖子ID我们可以得到帖子对应的url,不过下面<a>标签中直接给了。
我们获取链接并将其放入数组中:
post_info = post.find("a", {"class": "j_th_tit "})urls.append("http://tieba.baidu.com" + post_info.attrs["href"])
将上述代码打包,给出单页链接,返回此链接中所有帖子的url:
def get_url_from_page(page_url): ''' 对给定页面进行处理返回页面中帖子的url :param page_url: 页面链接 :return: 页面中所有帖子的url ''' bsObj_page = get_bsObj(page_url) urls = [] try: posts = bsObj_page.findAll("li", {"class": "j_thread_list"}) except AttributeError as e: print("发生错误:" + e + "时间:" + str(datetime.datetime.now())) for post in posts: post_info = post.find("a", {"class": "j_th_tit "}) urls.append("http://tieba.baidu.com" + post_info.attrs["href"]) return urls
处理每页的信息
上面我们得到了每页的地址,接下来我们处理每个帖子中的信息。我们需要在这页面中找到一些对我们有用的信息并将其存入csv文件中。
同样用这个地址举例:http://tieba.baidu.com/p/4700788764
首先,当打开这个链接是我们观察到的信息就是:帖子的标题,楼主名称,发帖时间,回复量等。
让我们观察一下这个页面的代码:
同样两个属性 class 和data-field,在data-field中包含了这个帖子的大部分信息:帖人id,发帖人昵称,性别,等级id,等级昵称,open_id,open_type,发帖日期等。
首先我们创建一个帖子对象,其中属性为帖子的信息,方法为将信息写入对应的csv文件:
class PostInfo: def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type, user_name,user_sex,level_id,level_name): self.post_id = post_id self.post_title = post_title self.post_url = post_url self.reply_num = reply_num self.post_date = post_date self.open_id = open_id self.open_type = open_type self.user_name = user_name self.user_sex = user_sex self.level_id = level_id self.level_name = level_name def dump_to_csv(self,filename): csvFile = open(filename, "a+") try: writer = csv.writer(csvFile) writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id, self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name)) finally: csvFile.close()
然后我们通过find方法找到对应信息:
obj1 = json.loads(bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()post_id = obj1.get('content').get('post_id')post_url = urlpost_date = obj1.get('content').get('date')open_id = obj1.get('content').get('open_id')open_type = obj1.get('content').get('open_type')user_name = obj1.get('author').get('user_name')user_sex = obj1.get('author').get('user_sex')level_id = obj1.get('author').get('level_id')level_name = obj1.get('author').get('level_name')
创建对象,将其保存:
postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,user_sex, level_id, level_name)postinfo.dump_to_csv('post_info2.csv')
其实不用通过对象保存,这只是个人想法。
将上面代码封装成处理每个帖子的函数:
def del_post(urls): ''' 处理传入url的帖子 :param url: :return: ''' for url in urls: bsObj = get_bsObj(url) try: obj1 = json.loads( bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field']) reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text() post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text() except: print("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url) with open('error.txt', 'a+') as f: f.writelines("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url) return None post_id = obj1.get('content').get('post_id') post_url = url post_date = obj1.get('content').get('date') open_id = obj1.get('content').get('open_id') open_type = obj1.get('content').get('open_type') user_name = obj1.get('author').get('user_name') user_sex = obj1.get('author').get('user_sex') level_id = obj1.get('author').get('level_id') level_name = obj1.get('author').get('level_name') postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name, user_sex, level_id, level_name) postinfo.dump_to_csv('post_info2.csv') del postinfo
得到的结果类似于:
98773024983,【轰动Python界】的学习速成高效大法,http://tieba.baidu.com/p/4811129571,2,2016-10-06 20:32,tieba,,openlabczx,0,7,贡士
组合使用上面的函数
首先,我们让用户输入需要爬去的贴吧的主页:
home_page_url = input("请输入要处理贴吧的主页链接")
处理url:
bar_name = home_page_url.split("=")[1].split("&")[0]pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn=" #page_url 不包含页数的前缀
处理主页:
all_post_num = del_mainPage(home_page_url) #贴吧一共有多少条帖子
让用户输入需要爬去的帖子数量:
del_post_num = int(input("请输入需要处理前多少条帖子:")) #指定需要处理的帖子数目
最后:
if del_post_num > all_post_num: print("需要处理的帖子数大于贴吧帖子总数!")else: for page in range(0,del_post_num,50): print("It's processing page : " + str(page)) page_url = pre_page_url+str(page) urls = get_url_from_page(page_url) t = threading.Thread(target=del_post,args=(urls,)) t.start()
主函数代码:
if __name__ == '__main__': #home_page_url = input("请输入要处理贴吧的主页链接") home_page_url = test_url bar_name = home_page_url.split("=")[1].split("&")[0] pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn=" #page_url 不包含页数的前缀 all_post_num = del_mainPage(home_page_url) #贴吧一共有多少条帖子 del_post_num = int(input("请输入需要处理前多少条帖子:")) #指定需要处理的帖子数目 if del_post_num > all_post_num: print("需要处理的帖子数大于贴吧帖子总数!") else: for page in range(0,del_post_num,50): print("It's processing page : " + str(page)) page_url = pre_page_url+str(page) urls = get_url_from_page(page_url) t = threading.Thread(target=del_post,args=(urls,)) t.start()
全部代码:
from bs4 import BeautifulSoupfrom urllib.request import urlopenfrom urllib.error import HTTPErrorimport jsonimport datetimeimport csvimport threadingclass PostInfo: def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type, user_name,user_sex,level_id,level_name): self.post_id = post_id self.post_title = post_title self.post_url = post_url self.reply_num = reply_num self.post_date = post_date self.open_id = open_id self.open_type = open_type self.user_name = user_name self.user_sex = user_sex self.level_id = level_id self.level_name = level_name def dump_to_csv(self,filename): csvFile = open(filename, "a+") try: writer = csv.writer(csvFile) writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id, self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name)) finally: csvFile.close()def get_bsObj(url): ''' 返回给定url的beautifulsoup对象 :param url:目标网址 :return:beautifulsoup对象 ''' try: html = urlopen(url).read() bsObj = BeautifulSoup(html, "lxml") return bsObj except HTTPError as e: print(e) return Nonedef del_mainPage(url): ''' 对主页面进行处理,返回最后页码 :param url:目标主页地址 :return:返回最后页码,int ''' bsObj_mainpage = get_bsObj(url) last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1]) try: red_text = bsObj_mainpage.findAll("span", {"class": "red_text"}) subject_sum = red_text[0].get_text() # 主题数 post_sum = red_text[1].get_text() # 帖子数 follow_sum = red_text[2].get_text() # 关注量 except AttributeError as e: print("发生错误:" + e + "时间:"+str(datetime.datetime.now())) return None with open('main_info.txt','w+') as f: f.writelines("统计时间: "+str(datetime.datetime.now())+"\n") f.writelines("主题数: "+subject_sum+"\n") f.writelines("帖子数: "+post_sum+"\n") f.writelines("关注量: "+follow_sum+"\n") return last_pagedef get_url_from_page(page_url): ''' 对给定页面进行处理返回页面中帖子的url :param page_url: 页面链接 :return: 页面中所有帖子的url ''' bsObj_page = get_bsObj(page_url) urls = [] try: posts = bsObj_page.findAll("li", {"class": "j_thread_list"}) except AttributeError as e: print("发生错误:" + e + "时间:" + str(datetime.datetime.now())) for post in posts: post_info = post.find("a", {"class": "j_th_tit "}) urls.append("http://tieba.baidu.com" + post_info.attrs["href"]) return urlsdef del_post(urls): ''' 处理传入url的帖子 :param url: :return: ''' for url in urls: bsObj = get_bsObj(url) try: obj1 = json.loads( bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field']) reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text() post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text() except: print("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url) with open('error.txt', 'a+') as f: f.writelines("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url) return None post_id = obj1.get('content').get('post_id') post_url = url post_date = obj1.get('content').get('date') open_id = obj1.get('content').get('open_id') open_type = obj1.get('content').get('open_type') user_name = obj1.get('author').get('user_name') user_sex = obj1.get('author').get('user_sex') level_id = obj1.get('author').get('level_id') level_name = obj1.get('author').get('level_name') postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name, user_sex, level_id, level_name) postinfo.dump_to_csv('post_info2.csv') # t = threading.Thread(target=postinfo.dump_to_csv,args=('post_info2.csv',)) # t.start() del postinfotest_url = "http://tieba.baidu.com/f?kw=python&ie=utf-8"if __name__ == '__main__': #home_page_url = input("请输入要处理贴吧的主页链接") home_page_url = test_url bar_name = home_page_url.split("=")[1].split("&")[0] pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn=" #page_url 不包含页数的前缀 all_post_num = del_mainPage(home_page_url) #贴吧一共有多少条帖子 del_post_num = int(input("请输入需要处理前多少条帖子:")) #指定需要处理的帖子数目 if del_post_num > all_post_num: print("需要处理的帖子数大于贴吧帖子总数!") else: for page in range(0,del_post_num,50): print("It's processing page : " + str(page)) page_url = pre_page_url+str(page) urls = get_url_from_page(page_url) t = threading.Thread(target=del_post,args=(urls,)) t.start() t.join() #del_post(urls)
以上
欢迎多来访问博客:http://liqiongyu.com/blog
微信公众号: