博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
百度贴吧的数据抓取和分析(一):指定条目帖子信息抓取
阅读量:5125 次
发布时间:2019-06-13

本文共 14743 字,大约阅读时间需要 49 分钟。

这个教程使用BeautifulSoup库爬取指定贴吧的帖子信息。

本教程的代码托管于github: https://github.com/w392807287/spider_baidu_bar

数据分析部分请移步:

python版本:3.5.2

使用BeautifulSoup库获取网页信息

引入相关库:

from bs4 import BeautifulSoupfrom urllib.request import urlopenfrom urllib.error import HTTPError

 

这里使用python吧为例子,python吧的主页为:http://tieba.baidu.com/f?ie=utf-8&kw=python&fr=search,精简一点http://tieba.baidu.com/f?kw=python

获取BeautifulSoup对象:

url = "http://tieba.baidu.com/f?kw=python"html = urlopen(url).read()bsObj = BeautifulSoup(html, "lxml")

 这里将这一段进行封装,封装成一个传入url返回bs对象的函数:

def get_bsObj(url):    '''    返回给定url的beautifulsoup对象    :param url:目标网址    :return:beautifulsoup对象    '''    try:        html = urlopen(url).read()        bsObj = BeautifulSoup(html, "lxml")        return bsObj    except HTTPError as e:        print(e)        return None

 这个函数传入一个url,返回beautifulsoup对象,如果发生错误则打印出错误并返回空值。

 贴吧主页的处理

在贴吧主页中,包含了这个贴吧的大概信息,比如:关注量,主题数量,帖子数量。我们将这些信息汇总到一个文件。

获取主页bs对象:

bsObj_mainpage = get_bsObj(url)

 获取总页数:

last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])

 获取最后一页的页数是为了后面爬去帖子时越界。其中使用了beautifulsoup的find方法。

获取需要的信息并写入文件:

red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})        subject_sum = red_text[0].get_text()  # 主题数        post_sum = red_text[1].get_text()  # 帖子数        follow_sum = red_text[2].get_text()  # 关注量    with open('main_info.txt','w+') as f:        f.writelines("统计时间: "+str(datetime.datetime.now())+"\n")        f.writelines("主题数:  "+subject_sum+"\n")        f.writelines("帖子数:  "+post_sum+"\n")        f.writelines("关注量:  "+follow_sum+"\n")

 最后将这些步骤封装成一个函数,传入主页面的url,写入信息,返回最后一页的页码:

def del_mainPage(url):    '''    对主页面进行处理,返回最后页码    :param url:目标主页地址    :return:返回最后页码,int    '''    bsObj_mainpage = get_bsObj(url)    last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])    try:        red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})        subject_sum = red_text[0].get_text()  # 主题数        post_sum = red_text[1].get_text()  # 帖子数        follow_sum = red_text[2].get_text()  # 关注量    except AttributeError as e:        print("发生错误:" + e + "时间:"+str(datetime.datetime.now()))        return None    with open('main_info.txt','w+') as f:        f.writelines("统计时间: "+str(datetime.datetime.now())+"\n")        f.writelines("主题数:  "+subject_sum+"\n")        f.writelines("帖子数:  "+post_sum+"\n")        f.writelines("关注量:  "+follow_sum+"\n")    return last_page

 得到的结果:

统计时间: 2016-10-07 15:14:19.642933主题数:  25083帖子数:  414831关注量:  76511

 

从页面中获取帖子地址

我们想要获取每个帖子的详细信息,就需要进入这个帖子,所以需要这个帖子的地址。比如:http://tieba.baidu.com/p/4700788764

这个url中,http://tieba.baidu.com是服务器地址,/p应该是路由中帖子(post)的对应,/4700788764即帖子的id。

我们观察贴吧首页会发现,每个帖子都是在一个块中。在浏览器中按F12观察其代码会发现,一个帖子所对应一个<li>标签,每个页面中有50个帖子(首页有广告贴可能不一样)。

其中<li>标签大概长这样:

  • 2671
    http://yuedu.baidu.com/ebook/ec1aa9f7b90d6c85ec3ac6d7?fr=index
  • 我们可以观察到<li>标签中的class=" j_thread_list clearfix" data-field="{"id":4700788764,"author_name":"\u6768\u5175507","first_post_id":95008842757,"reply_num":2671,"is_bakan":null,"vid":"","is_good":null,"is_top":null,"is_protal":null,"is_membertop":null,"frs_tpoint":null}"

    根据class属性我们可以找到单个页面中所有的帖子:

    posts = bsObj_page.findAll("li", {"class": "j_thread_list"})

     在data-field属性中我们可以得到:帖子ID,作者名称,回复数量,是否精品等信息。根据帖子ID我们可以得到帖子对应的url,不过下面<a>标签中直接给了。

    我们获取链接并将其放入数组中:

    post_info = post.find("a", {"class": "j_th_tit "})urls.append("http://tieba.baidu.com" + post_info.attrs["href"])

     将上述代码打包,给出单页链接,返回此链接中所有帖子的url:

    def get_url_from_page(page_url):    '''    对给定页面进行处理返回页面中帖子的url    :param page_url: 页面链接    :return: 页面中所有帖子的url    '''    bsObj_page = get_bsObj(page_url)    urls = []    try:        posts = bsObj_page.findAll("li", {"class": "j_thread_list"})    except AttributeError as e:        print("发生错误:" + e + "时间:" + str(datetime.datetime.now()))    for post in posts:        post_info = post.find("a", {"class": "j_th_tit "})        urls.append("http://tieba.baidu.com" + post_info.attrs["href"])    return urls

     

    处理每页的信息

    上面我们得到了每页的地址,接下来我们处理每个帖子中的信息。我们需要在这页面中找到一些对我们有用的信息并将其存入csv文件中。

    同样用这个地址举例:http://tieba.baidu.com/p/4700788764

    首先,当打开这个链接是我们观察到的信息就是:帖子的标题,楼主名称,发帖时间,回复量等。

    让我们观察一下这个页面的代码:

     同样两个属性 class 和data-field,在data-field中包含了这个帖子的大部分信息:帖人id,发帖人昵称,性别,等级id,等级昵称,open_id,open_type,发帖日期等。

    首先我们创建一个帖子对象,其中属性为帖子的信息,方法为将信息写入对应的csv文件:

    class PostInfo:    def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type,                 user_name,user_sex,level_id,level_name):        self.post_id = post_id        self.post_title = post_title        self.post_url = post_url        self.reply_num = reply_num        self.post_date = post_date        self.open_id = open_id        self.open_type = open_type        self.user_name = user_name        self.user_sex = user_sex        self.level_id = level_id        self.level_name = level_name    def dump_to_csv(self,filename):        csvFile = open(filename, "a+")        try:            writer = csv.writer(csvFile)            writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id,                             self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name))        finally:            csvFile.close()

     然后我们通过find方法找到对应信息:

    obj1 = json.loads(bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()post_id = obj1.get('content').get('post_id')post_url = urlpost_date = obj1.get('content').get('date')open_id = obj1.get('content').get('open_id')open_type = obj1.get('content').get('open_type')user_name = obj1.get('author').get('user_name')user_sex = obj1.get('author').get('user_sex')level_id = obj1.get('author').get('level_id')level_name = obj1.get('author').get('level_name')

     创建对象,将其保存:

    postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,user_sex, level_id, level_name)postinfo.dump_to_csv('post_info2.csv')

     其实不用通过对象保存,这只是个人想法。

    将上面代码封装成处理每个帖子的函数:

    def del_post(urls):    '''    处理传入url的帖子    :param url:    :return:    '''    for url in urls:        bsObj = get_bsObj(url)        try:            obj1 = json.loads(                bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])            reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()            post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()        except:            print("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)            with open('error.txt', 'a+') as f:                f.writelines("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)            return None        post_id = obj1.get('content').get('post_id')        post_url = url        post_date = obj1.get('content').get('date')        open_id = obj1.get('content').get('open_id')        open_type = obj1.get('content').get('open_type')        user_name = obj1.get('author').get('user_name')        user_sex = obj1.get('author').get('user_sex')        level_id = obj1.get('author').get('level_id')        level_name = obj1.get('author').get('level_name')        postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,                            user_sex, level_id, level_name)        postinfo.dump_to_csv('post_info2.csv')        del postinfo

    得到的结果类似于:

    98773024983,【轰动Python界】的学习速成高效大法,http://tieba.baidu.com/p/4811129571,2,2016-10-06 20:32,tieba,,openlabczx,0,7,贡士

     

    组合使用上面的函数

    首先,我们让用户输入需要爬去的贴吧的主页:

    home_page_url = input("请输入要处理贴吧的主页链接")

     处理url:

    bar_name = home_page_url.split("=")[1].split("&")[0]pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含页数的前缀

     处理主页:

    all_post_num = del_mainPage(home_page_url)      #贴吧一共有多少条帖子

     让用户输入需要爬去的帖子数量:

    del_post_num = int(input("请输入需要处理前多少条帖子:"))     #指定需要处理的帖子数目

     最后:

    if del_post_num > all_post_num:    print("需要处理的帖子数大于贴吧帖子总数!")else:    for page in range(0,del_post_num,50):        print("It's processing page : " + str(page))        page_url = pre_page_url+str(page)        urls = get_url_from_page(page_url)        t = threading.Thread(target=del_post,args=(urls,))        t.start()

     

    主函数代码:

    if __name__ == '__main__':    #home_page_url = input("请输入要处理贴吧的主页链接")    home_page_url = test_url    bar_name = home_page_url.split("=")[1].split("&")[0]    pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含页数的前缀    all_post_num = del_mainPage(home_page_url)      #贴吧一共有多少条帖子    del_post_num = int(input("请输入需要处理前多少条帖子:"))     #指定需要处理的帖子数目    if del_post_num > all_post_num:        print("需要处理的帖子数大于贴吧帖子总数!")    else:        for page in range(0,del_post_num,50):            print("It's processing page : " + str(page))            page_url = pre_page_url+str(page)            urls = get_url_from_page(page_url)            t = threading.Thread(target=del_post,args=(urls,))            t.start()

     全部代码:

    from bs4 import BeautifulSoupfrom urllib.request import urlopenfrom urllib.error import HTTPErrorimport jsonimport datetimeimport csvimport threadingclass PostInfo:    def __init__(self,post_id,post_title,post_url,reply_num,post_date,open_id,open_type,                 user_name,user_sex,level_id,level_name):        self.post_id = post_id        self.post_title = post_title        self.post_url = post_url        self.reply_num = reply_num        self.post_date = post_date        self.open_id = open_id        self.open_type = open_type        self.user_name = user_name        self.user_sex = user_sex        self.level_id = level_id        self.level_name = level_name    def dump_to_csv(self,filename):        csvFile = open(filename, "a+")        try:            writer = csv.writer(csvFile)            writer.writerow((self.post_id,self.post_title,self.post_url,self.reply_num,self.post_date,self.open_id,                             self.open_type,self.user_name,self.user_sex,self.level_id,self.level_name))        finally:            csvFile.close()def get_bsObj(url):    '''    返回给定url的beautifulsoup对象    :param url:目标网址    :return:beautifulsoup对象    '''    try:        html = urlopen(url).read()        bsObj = BeautifulSoup(html, "lxml")        return bsObj    except HTTPError as e:        print(e)        return Nonedef del_mainPage(url):    '''    对主页面进行处理,返回最后页码    :param url:目标主页地址    :return:返回最后页码,int    '''    bsObj_mainpage = get_bsObj(url)    last_page = int(bsObj_mainpage.find("a",{"class":"last pagination-item "})['href'].split("=")[-1])    try:        red_text = bsObj_mainpage.findAll("span", {"class": "red_text"})        subject_sum = red_text[0].get_text()  # 主题数        post_sum = red_text[1].get_text()  # 帖子数        follow_sum = red_text[2].get_text()  # 关注量    except AttributeError as e:        print("发生错误:" + e + "时间:"+str(datetime.datetime.now()))        return None    with open('main_info.txt','w+') as f:        f.writelines("统计时间: "+str(datetime.datetime.now())+"\n")        f.writelines("主题数:  "+subject_sum+"\n")        f.writelines("帖子数:  "+post_sum+"\n")        f.writelines("关注量:  "+follow_sum+"\n")    return last_pagedef get_url_from_page(page_url):    '''    对给定页面进行处理返回页面中帖子的url    :param page_url: 页面链接    :return: 页面中所有帖子的url    '''    bsObj_page = get_bsObj(page_url)    urls = []    try:        posts = bsObj_page.findAll("li", {"class": "j_thread_list"})    except AttributeError as e:        print("发生错误:" + e + "时间:" + str(datetime.datetime.now()))    for post in posts:        post_info = post.find("a", {"class": "j_th_tit "})        urls.append("http://tieba.baidu.com" + post_info.attrs["href"])    return urlsdef del_post(urls):    '''    处理传入url的帖子    :param url:    :return:    '''    for url in urls:        bsObj = get_bsObj(url)        try:            obj1 = json.loads(                bsObj.find("div", attrs={"class": "l_post j_l_post l_post_bright noborder "}).attrs['data-field'])            reply_num = bsObj.find("li", attrs={"class": "l_reply_num"}).span.get_text()            post_title = bsObj.find("h1", attrs={"class": "core_title_txt"}).get_text()        except:            print("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)            with open('error.txt', 'a+') as f:                f.writelines("发生错误:" + "---" + "时间:" + str(datetime.datetime.now()) + url)            return None        post_id = obj1.get('content').get('post_id')        post_url = url        post_date = obj1.get('content').get('date')        open_id = obj1.get('content').get('open_id')        open_type = obj1.get('content').get('open_type')        user_name = obj1.get('author').get('user_name')        user_sex = obj1.get('author').get('user_sex')        level_id = obj1.get('author').get('level_id')        level_name = obj1.get('author').get('level_name')        postinfo = PostInfo(post_id, post_title, post_url, reply_num, post_date, open_id, open_type, user_name,                            user_sex, level_id, level_name)        postinfo.dump_to_csv('post_info2.csv')        # t = threading.Thread(target=postinfo.dump_to_csv,args=('post_info2.csv',))        # t.start()        del postinfotest_url = "http://tieba.baidu.com/f?kw=python&ie=utf-8"if __name__ == '__main__':    #home_page_url = input("请输入要处理贴吧的主页链接")    home_page_url = test_url    bar_name = home_page_url.split("=")[1].split("&")[0]    pre_page_url = "http://tieba.baidu.com/f?kw=" + bar_name + "&ie=utf-8&pn="      #page_url 不包含页数的前缀    all_post_num = del_mainPage(home_page_url)      #贴吧一共有多少条帖子    del_post_num = int(input("请输入需要处理前多少条帖子:"))     #指定需要处理的帖子数目    if del_post_num > all_post_num:        print("需要处理的帖子数大于贴吧帖子总数!")    else:        for page in range(0,del_post_num,50):            print("It's processing page : " + str(page))            page_url = pre_page_url+str(page)            urls = get_url_from_page(page_url)            t = threading.Thread(target=del_post,args=(urls,))            t.start()    t.join()            #del_post(urls)

     

    以上  

    欢迎多来访问博客:http://liqiongyu.com/blog

    微信公众号:

    转载于:https://www.cnblogs.com/Liqiongyu/p/5936019.html

    你可能感兴趣的文章
    千位分隔符的完整攻略
    查看>>
    PHP 递归删除目录中文件
    查看>>
    小甲鱼Python笔记(下)
    查看>>
    面试题19:二叉树镜像
    查看>>
    Android端实时音视频开发指南
    查看>>
    C++ 一键关闭屏幕
    查看>>
    关于生活
    查看>>
    基金基础知识
    查看>>
    loadrunner学习理论之一
    查看>>
    C++ 初始化列表初始化列表性能问题的简单的探索
    查看>>
    MyBatis入门
    查看>>
    曾国藩:诚敬静谨恒!
    查看>>
    ASP.NET数据格式的Format-- DataFormatString
    查看>>
    IOS+Android马甲包封装上架!
    查看>>
    【Immutable】拷贝与JSON.parse(JSON.stringify()),深度比较相等与underscore.isEqual(),性能比较...
    查看>>
    WPF - 自定义标记扩展
    查看>>
    WLC exclusionlist
    查看>>
    Calculation控制台
    查看>>
    unity3d教程游戏包含的一切文件导入资源
    查看>>
    Swift的笔记和参考
    查看>>