python爬虫之pyppeteer库简单使用

python爬虫之pyppeteer库简单使用

2023年6月29日发(作者:)

python爬⾍之pyppeteer库简单使⽤pyppeteer介绍Pyppeteer之前先说⼀下Puppeteer,Puppeteer是⾕歌出品的⼀款基于开发的⼀款⼯具,主要是⽤来操纵Chrome浏览器的 API,通过Javascript代码来操纵Chrome浏览器,完成数据爬取、Web程序⾃动测试等任务。是⾮官⽅ Python 版本的 Puppeteer 库,浏览器⾃动化库,由⽇本⼯程师开发。是 Google 基于 开发的⼯具,调⽤ Chrome 的 API,通过 JavaScript 代码来操纵 Chrome 完成⼀些操作,⽤于⽹络爬⾍、Web 程序⾃动测试等。pyppeteer 使⽤了 Python 异步协程库,可整合 Scrapy 进⾏分布式爬⾍。puppet ⽊偶,puppeteer 操纵⽊偶的⼈。pyppeteer和puppeteer的不同点pyppeteer⽀持字典和关键字传参,puppeteer只⽀持字典传参# puppeteer⽀⽀持字典传参browser = await launch({'headless':True})# pyppeteer⽀持字典和关键字传参browser = await launch({'headless':True})browser = await launch(headless=True)元素选择器⽅法名$变为querySelector# puppeteer使⽤$符page.$()/page.%%()/page.$x()# pyppeteer使⽤python风格的函数名elector()/electorAll()/()# 简写⽅式page.J()/()/()e()和electorEval()的参数puppeteer的evaluate()⽅法使⽤JavaScript原⽣函数或JavaScript表达式字符串。pyppeteer的evaluate()⽅法只使⽤JavaScript字符串,该字符串可以是函数也可以是表达式,pyppeteer会进⾏⾃动判断。但有时会判断错误,如果字符串被判断成了函数,并且报错,可以添加参数force_expr=True,强制pyppeteer作为表达式处理。获取⽹页内容:content = await te('ntent',force_expr=True)获取元素的内部⽂字:element = await elector('h1')title = await te('(element) => ntent',element)安装1、安装pyppeteerpip install pyppeteer2、安装chromiumpyppeteer-install简单使⽤import asynciofrom pyppeteer import launchasync def main(): url = '/' # headless参数设置为Falase,则变成有头模式 browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation']) page = await e()

# 设置页⾯视图⼤⼩ await wport(viewport={'width':1600,'herght':900})

# 是否启⽤JS,enabled设为False,则⽆渲染效果 await aScriptEnable(enabled=True)

# 等待时间1000毫秒 res = await (url,options={'timeout':1000}) resp_headers = s # 响应头 resp_status = # 响应状态

# 等待 await (2) await r(1000) # 第⼆种⽅法 ,在while循环⾥强⾏查询某元素进⾏等待 while not await elector('.t')

# 滚动到页⾯底部 await te('By(0,Height)')

await shot({'path':''})

# 打印⽹页cookies print(await s()) # 获取所有html内容 print(await t())

dimensions = await te(pageFunction='''() => { return { width:idth, // 页⾯宽度 height:eight, // 页⾯⾼度 deviceScaleFactor: PixelRatio, // 像素⽐1.1612 } }''',force_expr=False) # force_expr=False 执⾏的是函数 print(dimensions)

content = await te(pageFunction='ntent',force_expr=True) # 只获得⽂本 执⾏js脚本,force_expr=True 执⾏的是表达式 print(content)

# 打印当前页⾯的标题 print(await ())

# 抓取新闻内容 可以使⽤xpath表达式 ''' pyppeteer 三种解析⽅式 elector() electorAll() () 简写⽅式为: page.J() () () ''' element = await elector(".feed-infinite-wrapper > ul>li") print(element)

element = await electorAll(".title-box a") for item in element: print(await perty('textContent')) # 获取⽂本内容 title_str = await (await perty('textContent')).jsonValue()

title_link = await (await perty('textContent')).jsonValue()

# 获取属性值 # title = await (await perty('class')).jsonValue() print(title_str,title_link) await ()_event_loop().run_until_complete(main())模拟⽂本输⼊和点击# 模拟输⼊账号密码 参数{'delay':reand_int()} 延迟输⼊时间await ('#kw',"百度",delay=100)await ('#TPL_username_1',"asdasd")await r(1000)await ('#su')移除Chrome正受到⾃动测试软件的控制browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation'])# 添加ignoreDefaultArgs=['--enable-automation'] 参数爬取京东商城from bs4 import BeautifulSoupfrom pyppeteer import launchimport asynciodef screen_size(): return 1600,900async def main(url): browser = await launch({"args":['--no-sandbox'],}) # "headless":False page = await e() width, height = screen_size() await wport(viewport={'width':width,'height':height}) await aScriptEnabled(enabled=True) await rAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36') await (url) await te('By(0, Height)') await (1) # content = await t() li_list = await ('//*[@id="J_goodsList"]/ul/li') item_list = [] for li in li_list: a = await ('.//div[@class="p-img"]/a') detail_url = await (await a[0].getProperty('href')).jsonValue() promo_words = await (await a[0].getProperty('title')).jsonValue() a_ = await ('.//div[@class="p-commit"]/strong/a') p_commit = await (await a_[0].getProperty('textContent')).jsonValue() i = await ('./div/div[3]/strong/i') price = await (await i[0].getProperty('textContent')).jsonValue() em = await ('./div/div[4]/a/em') title = await (await em[0].getProperty('textContent')).jsonValue() item = { "title" : title, "detail_url" : detail_url, "promp_words" : promo_words, "p_commit" : p_commit, "price" : price } item_(item) await page_close(browser) return item_listasync def page_close(browser): for _page in await (): await _() await ()url = '/Search?keyword=%E6%89%8B%E6%9C%BA&wq=' '%E6%89%8B%E6%9C%BA&pvid=e07184578b8442c58ddd65b221020e99&page={}&s=56&click=0 'task_list = []for i in range(1,4): page = i * 2 - 1 task_(main((page)))results = _event_loop().run_until_complete((*task_list))for i in results: print(i,len(i))print('*'*100)到此这篇关于python爬⾍之pyppeteer库的⽂章就介绍到这了,更多相关python爬⾍pyppeteer库内容请搜索以前的⽂章或继续浏览下⾯的相关⽂章希望⼤家以后多多⽀持!

发布者:admin,转转请注明出处:http://www.yc00.com/news/1687982429a63507.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信