python如何全网爬取_Python爬取全网热点榜单数据|江阴雨辰互联

2023年6月29日发(作者：)

python如何全⽹爬取_Python爬取全⽹热点榜单数据⼀、主题式⽹络爬⾍设计⽅案1.主题式⽹络爬⾍名称：爬取全⽹热点榜单数据2.主题式⽹络爬⾍爬取的内容与数据特征分析：1)热门榜单；2)数据有⽇期、标题、链接地址等3.主题式⽹络爬⾍设计⽅案概述：1)HTML页⾯分析得到HTML代码结构；2)程序实现：a. 定义代码字典；b. ⽤requests抓取⽹页信息；c. ⽤BeautifulSoup库解析⽹页；d. ⽤pandas库保存数据为xls；e. 定义主函数main()；f. 定义功能函数，解耦；⼆、主题页⾯的结构特征分析页⾯解析3.节点(标签)查找⽅法与遍历⽅法：使⽤ find_all() 和 find() ⽅法寻找关键class获取数据三、⽹络爬⾍程序设计1.数据爬取与采集⽤requests抓取⽹页信息，设置UA(User-Agent)，访问获取⽹页数据；部分代码：import requestsdef getHtml(url):headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/538.55 (KHTML, like Gecko)Chrome/81.0.3345.132 Safari/538.55'}resp = (url, headers=headers)return 部分运⾏截图：2.对数据进⾏清洗和处理⽤BeautifulSoup库解析⽹页，find_all()⽅法寻找需要的数据，然后find()⽅法通过class标签寻找关键数据；部分代码：from bs4 import BeautifulSoupdef get_data(html):soup = BeautifulSoup(html, '')nodes = _all('div', class_='cc-cd')return nodesdef get_node_data(df, nodes):now = int(())for node in nodes:source = ('div', class_='cc-cd-lb').()messages = ('div', class_='cc-cd-cb-l nano-content').find_all('a')for message in messages:content = ('span', class_='t').()if source == '微信':reg = '「.+?」(.+)'content = l(reg, content)[0]if or df[t == content].empty:data = {'content': [content],'url': [message['href']],'source': [source],'start_time': [now],'end_time': [now]}item = ame(data)df = ([df, item], ignore_index=True)else:index = df[t == content].index[0][index, 'end_time'] = nowreturn df部分运⾏截图：3.数据持久化⽤pandas库保存数据为xls；部分代码：import pandasres = _excel('')res = get_node_data(res, data)_excel('')部分运⾏截图：4.将以上各部分的代码汇总，完整代码：import requestsfrom bs4 import BeautifulSoupimport timeimport pandasimport redef getHtml(url):headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/538.55 (KHTML, like Gecko)Chrome/81.0.3345.132 Safari/538.55'}resp = (url, headers=headers)return f get_data(html):soup = BeautifulSoup(html, '')nodes = _all('div', class_='cc-cd')return nodesdef get_node_data(df, nodes):now = int(())for node in nodes:source = ('div', class_='cc-cd-lb').()messages = ('div', class_='cc-cd-cb-l nano-content').find_all('a')for message in messages:content = ('span', class_='t').()if source == '微信':reg = '「.+?」(.+)'content = l(reg, content)[0]if or df[t == content].empty:data = {'content': [content],'url': [message['href']],'source': [source],'start_time': [now],'end_time': [now]}item = ame(data)df = ([df, item], ignore_index=True)else:index = df[t == content].index[0][index, 'end_time'] = nowreturn dfurl = ''html = getHtml(url)data = get_data(html)res = _excel('')res = get_node_data(res, data)_excel('')四、结论本次程序设计任务补考，我选择的课题是爬取全⽹热门榜单聚合数据，并不是每个⽹站的榜单数据，平时也经常使⽤这个⽹站关注全国的热点资讯。对于这个⽹站的爬取相对简单也⽐较熟悉，⾸先它是⼀个静态⽹页，其次节点也相当好找，通过class标签就可以轻松找到，⽽且爬⾍部分也不需要特别的伪装，设置好UA信息，伪装成正常访客就可以了。⼩结：1.编码很重要，⼀开始爬取的数据解析后中⽂都乱码了，主要是GBK和UTF-8编码转换的问题；2.养成写代码解耦分部并且检查的习惯，⼀开始代码⼀路写下来，全部是⼀坨，出问题⾮常难定位到哪⾥错了。修改分函数后，每个部分每个功能独⽴出来，不仅代码看起来直观了，出现问题也变少；3.基础不够，还是需要继续努⼒；最后，通过这次的补考，让我对python的应⽤有了更进⼀步的提升，受益良多。

发布者：admin，转转请注明出处：http://www.yc00.com/web/1687985742a63943.html