Python爬虫实战(六):对某一关键词的百度指数数据的爬取

Python爬虫实战(六):对某一关键词的百度指数数据的爬取

2023年6月29日发(作者:)

Python爬⾍实战(六):对某⼀关键词的百度指数数据的爬取追风赶⽉莫停留,平芜尽处是春⼭。⽂章⽬录终于有时间来更新我的博客了!!

这次咱们来搞⼀搞百度指数的爬取。⼀、⽹页分析咱们以爬⾍为关键词,进⾏百度指数的分析然后F12开发者模式,然后刷新,依次点击Network ->

XHR ->

index?area=0&word=... ->

Preview,然后你就会看到这些都是个啥啊,显然data⾥⾯是加密了的,头秃。先按下不表,接着看后⾯。⼆、接⼝分析url分析很明显,他有三个参数:1. word这个参数掌管要搜索的关键词2. startDate这个参数代表数据开始的时间3. endDate这个参数代表数据结束的时间如果你能掌管好这三个参数,那数据不就是⼿到擒来嘛!返回数据分析是get请求,返回数据格式是json格式,编码为utf-8三、编写代码知道了url规则,以及返回数据的格式,那现在咱们的任务就是构造url然后请求数据url = "/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22{}%22,%22wordType%22:1%7D]]&startDate=2011-01-02&endDate=2022-01-02".format(keyword)那就直接上呗,直接请求他所以我们为了⽅便就把请求⽹页的代码写成了函数get_html(url),传⼊的参数是url返回的是请求到的内容。def get_html(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36", "Host": "", "Referer": "/v2/main/", } cookies = { "Cookie": 你的cookie } response = (url, headers=headers, cookies=cookies) return 注意这⾥⼀定要把你的cookie替换掉,不然请求不到内容。获取数据将获得的数据格式化为json格式的数据。def get_html(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36", "Host": "", "Referer": "/v2/main/", } cookies = { "Cookie": 你的cookie } response = (url, headers=headers, cookies=cookies) return f get_data(keyword): url = "/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22{}%22,%22wordType%22:1%7D]]&startDate=2011-01-02&endDate=2022-01-02".format(keyword) data = get_html(url) data = (data) data = data['data']['userIndexes'][0]['all']['data']ok,到此数据就获取完了,我们下期见,拜拜~好了,这个加密的data怎么处理呢其中data可以看到应该是加密了的,all是表⽰全部数据,pc是指pc端,wise是移动端,这些可以在js⽂件⾥找到;⾸先先搞清楚这个像加密了的data是怎么解密的;我们现在知道这个数据是json格式,那么它处理肯定要从中取出这些data,所以,重新刷新⼀下⽹页,⽬的是为了让所有js都能加载出来,然后利⽤搜索功能从中找。搜索过程就不上图了,我是搜索 decrypt找到的;⾸先,我⽤decrypt找到了⼀个js⽂件,其中有⼀个名为decrypt的⽅法解密def decrypt(t,e): n = list(t) i = list(e) a = {} result = [] ln = int(len(n)/2) start = n[ln:] end = n[:ln] for j,k in zip(start, end): ({k: j}) for j in e: ((j)) return ''.join(result)def get_ptbk(uniqid): url = '/Interface/ptbk?uniqid={}' resp = get_html((uniqid)) return (resp)['data']完整代码# -*- coding:utf-8 -*-# -*- coding:utf-8 -*-# @time: 2022/1/4 8:35# @Author:

韩国麦当劳# @Environment: Python 3.7import datetimeimport requestsimport sysimport timeimport jsonword_url = '/api/SearchApi/thumbnail?area=0&word={}'def get_html(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36", "Host": "", "Referer": "/v2/main/", } cookies = { 'Cookie': 你的Cookie } response = (url, headers=headers, cookies=cookies) return f decrypt(t, e): n = list(t) i = list(e) a = {} result = [] ln = int(len(n) / 2) start = n[ln:] end = n[:ln] for j, k in zip(start, end): ({k: j}) for j in e: ((j)) return ''.join(result)def get_ptbk(uniqid): url = '/Interface/ptbk?uniqid={}' resp = get_html((uniqid)) return (resp)['data']def get_data(keyword, start='2011-01-02', end='2022-01-02'): url = "/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22{}%22,%22wordType%22:1%7D]]&startDate={}&endDate={}".format(keyword, start, end) data = get_html(url) data = (data) uniqid = data['data']['uniqid'] data = data['data']['userIndexes'][0]['all']['data'] ptbk = get_ptbk(uniqid) result = decrypt(ptbk, data) result = (',') start = start_("-") end = end_("-") a = (int(start[0]), int(start[1]), int(start[2])) b = (int(end[0]), int(end[1]), int(end[2])) node = 0 for i in range(nal(), nal()): date = dinal(i) print(date, result[node]) node += 1if __name__ == '__main__': keyword = "爬⾍" start_date = "2011-01-02" end_date = "2022-01-02" get_data(keyword, start_date, end_date)欢迎⼀键三连哦!还想看哪个⽹站的爬⾍?欢迎留⾔,说不定下次要分析的就是你想要看的!

发布者:admin,转转请注明出处:http://www.yc00.com/news/1687982336a63495.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信