python爬虫入门(三)XPATH和BeautifulSoup4

python爬虫入门(三)XPATH和BeautifulSoup4

2023年6月29日发(作者:)

python爬⾍⼊门(三)XPATH和BeautifulSoup4⽤正则处理HTML⽂档很⿇烦,我们可以先将 HTML⽂件 转换成 XML⽂档,然后⽤ XPath 查找 HTML 节点或元素。XML 指可扩展标记语⾔(EXtensible Markup Language)XML 是⼀种标记语⾔,很类似 HTMLXML 的设计宗旨是传输数据,⽽⾮显⽰数据XML 的标签需要我们⾃⾏定义。XML 被设计为具有⾃我描述性。XML 是 W3C 的推荐标准

Everyday Italian

Giada De Laurentiis

2005

30.00

Harry Potter

J K. Rowling

2005

29.99

XQuery Kick Start

James McGovern

Per Bothner

Kurt Cagle

James Linn

Vaidyanathan Nagarajan

2003

49.99

Learning XML

Erik T. Ray

2003

39.95

XML实例XML和HTML区别HTML DOM 模型⽰例HTML DOM 定义了访问和操作 HTML ⽂档的标准⽅法,以树结构⽅式表达 HTML ⽂档

XPATHXPath (XML Path Language) 是⼀门在 XML ⽂档中查找信息的语⾔,可⽤来在 XML ⽂档中对元素和属性进⾏遍历。chrome插件XPATH HelPerFirefox插件XPATH CheckerXPATH语法最常⽤的路径表达式:谓语谓语⽤来查找某个特定的节点或者包含某个指定的值的节点,被嵌在⽅括号中。在下⾯的表格中,我们列出了带有谓语的⼀些路径表达式,以及表达式的结果:选取位置节点选取若⼲路劲库安装:pip install lxmllxml 是 ⼀个HTML/XML的解析器,主要的功能是如何解析和提取 HTML/XML 数据。lxml和正则⼀样,也是⽤ C 实现的,是⼀款⾼性能的 Python HTML/XML 解析器,可以利⽤XPath语法,来快速的定位特定元素以及节点信息。 简单使⽤⽅法#!/usr/bin/env python# -*- coding:utf-8 -*-from lxml import etreetext = '''

  • 11
  • 22
  • 33
  • 44
  • '''#利⽤,将字符串解析为HTML⽂档html = (text)# 按字符串序列化HTML⽂档result = ng(html)print(result)结果: 1.先找到每个帖⼦列表的url集合2.再找到每个帖⼦⾥⾯的每个图⽚的的完整url链接3.要⽤到 lxml 模块去解析html#!/usr/bin/env python# -*- coding:utf-8 -*-import urllibimport urllib2from lxml import etreedef loadPage(url): """ 作⽤:根据url发送请求,获取服务器响应⽂件 url: 需要爬取的url地址 """ request = t(url) html = n(request).read() # 解析HTML⽂档为HTML DOM模型 content = (html) # 返回所有匹配成功的列表集合 link_list = ('//div[@class="t_con cleafix"]/div/div/div/a/@href') for link in link_list: fulllink = "" + link # 组合为每个帖⼦的链接 #print link loadImage(fulllink)# 取出每个帖⼦⾥的每个图⽚连接def loadImage(link): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'} request = t(link, headers = headers) html = n(request).read() # 解析 content = (html) # 取出帖⼦⾥每层层主发送的图⽚连接集合 link_list = ('//img[@class="BDE_Image"]/@src') # 取出每个图⽚的连接 for link in link_list: # print link writeImage(link)def writeImage(link): """ 作⽤:将html内容写⼊到本地 link:图⽚连接 """ #print "正在保存 " + filename headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"} # ⽂件写⼊ request = t(link, headers = headers) # 图⽚原始数据 image = n(request).read() # 取出连接后10位做为⽂件名 filename = link[-10:] # 写⼊到本地磁盘⽂件内 with open(filename, "wb") as f: (image) print "已经成功下载 "+ filenamedef tiebaSpider(url, beginPage, endPage): """ 作⽤:贴吧爬⾍调度器,负责组合处理每个页⾯的url url : 贴吧url的前部分 beginPage : 起始页 endPage : 结束页 """ for page in range(beginPage, endPage + 1): pn = (page - 1) * 50 #filename = "第" + str(page) + "页.html" fullurl = url + "&pn=" + str(pn) #print fullurl loadPage(fullurl) #print html print "谢谢使⽤"if __name__ == "__main__": kw = raw_input("请输⼊需要爬取的贴吧名:") beginPage = int(raw_input("请输⼊起始页:")) endPage = int(raw_input("请输⼊结束页:")) url = "/f?" key = ode({"kw": kw}) fullurl = url + key tiebaSpider(fullurl, beginPage, endPage)4.爬取的图⽚全部保存到了电脑⾥⾯选择器:BeautifulSoup4和 lxml ⼀样,Beautiful Soup 也是⼀个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据。lxml 只会局部遍历,⽽Beautiful Soup 是基于HTML DOM的,会载⼊整个⽂档,解析整个DOM树,因此时间和内存开销都会⼤很多,所以性能要低于lxml。BeautifulSoup ⽤来解析 HTML ⽐较简单,API⾮常⼈性化,⽀持、Python标准库中的HTML解析器,也⽀持 lxml 的 XML解析器。Beautiful Soup 3 ⽬前已经停⽌开发,推荐现在的项⽬使⽤Beautiful Soup 4。使⽤ pip 安装即可:pip install beautifulsoup4

    使⽤Beautifulsoup4爬取腾讯招聘职位信息from bs4 import BeautifulSoupimport urllib2import urllibimport json # 使⽤了json格式存储def tencent(): url = '/' request = t(url + '?&start=10#a') response =n(request) resHtml = () output =open('','w') html = BeautifulSoup(resHtml,'lxml')# 创建CSS选择器 result = ('tr[class="even"]') result2 = ('tr[class="odd"]') result += result2 items = [] for site in result: item = {} name = ('td a')[0].get_text() detailLink = ('td a')[0].attrs['href'] catalog = ('td')[1].get_text() recruitNumber = ('td')[2].get_text() workLocation = ('td')[3].get_text() publishTime = ('td')[4].get_text() item['name'] = name item['detailLink'] = url + detailLink item['catalog'] = catalog item['recruitNumber'] = recruitNumber item['publishTime'] = publishTime (item) # 禁⽤ascii编码,按utf-8编码 line = (items,ensure_ascii=False) (('utf-8')) ()if __name__ == "__main__": tencent()JSON(JavaScript Object Notation) 是⼀种轻量级的数据交换格式,它使得⼈们很容易的进⾏阅读和编写。同时也⽅便了机器进⾏解析和⽣成。适⽤于进⾏数据交互的场景,⽐如⽹站前台与后台之间的数据交互。JsonPath 是⼀种信息抽取类库,是从JSON⽂档中抽取指定信息的⼯具,提供多种语⾔实现版本,包括:Javascript, Python, PHP 和 Java。JsonPath 对于 JSON 来说,相当于 XPATH 对于 XML。JsonPath与XPath语法对⽐:Json结构清晰,可读性⾼,复杂度低,⾮常容易匹配,下表中对应了XPath的⽤法。JSONPath爬取拉勾⽹上所有的城市#!/usr/bin/env python# -*- coding:utf-8 -*-import urllib2# json解析库,对应到lxmlimport json# json的解析语法,对应到xpathimport jsonpathurl = "/lbs/"headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}request = t(url, headers = headers)response = n(request)# 取出json⽂件⾥的内容,返回的格式是字符串html = ()# 把json形式的字符串转换成python形式的Unicode字符串unicodestr = (html)# Python形式的列表city_list = th(unicodestr, "$..name")#for item in city_list:# print item# dumps()默认中⽂为ascii编码格式,ensure_ascii默认为Ture# 禁⽤ascii编码格式,返回的Unicode字符串,⽅便使⽤array = (city_list, ensure_ascii=False)#(city_list)#array = (city_list)with open("", "w") as f: (("utf-8"))结果: 糗事百科爬取1. 利⽤XPATH的模糊查询2. 获取每个帖⼦⾥的内容3. 保存到 json ⽂件内#!/usr/bin/env python# -*- coding:utf-8 -*-import urllib2import jsonfrom lxml import etreeurl = "/8hr/page/2/"headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}request = t(url, headers = headers)html = n(request).read()# 响应返回的是字符串,解析为HTML DOM模式 text = (html)text = (html)# 返回所有段⼦的结点位置,contains()模糊查询⽅法,第⼀个参数是要匹配的标签,第⼆个参数是标签名部分内容node_list = ('//div[contains(@id, "qiushi_tag")]')items ={}for node in node_list: # xpath返回的列表,这个列表就这⼀个参数,⽤索引⽅式取出来,⽤户名 username = ('./div/a/@title')[0] # 取出标签下的内容,段⼦内容 content = ('.//div[@class="content"]/span')[0].text # 取出标签⾥包含的内容,点赞 zan = ('.//i')[0].text # 评论 comments = ('.//i')[1].text items = { "username" : username, "content" : content, "zan" : zan, "comments" : comments } with open("", "a") as f: ((items, ensure_ascii=False).encode("utf-8") + "n")

    发布者:admin,转转请注明出处:http://www.yc00.com/web/1687985209a63871.html

    相关推荐

    发表回复

    评论列表(0条)

    • 暂无评论

    联系我们

    400-800-8888

    在线咨询: QQ交谈

    邮件:admin@example.com

    工作时间:周一至周五,9:30-18:30,节假日休息

    关注微信