2023年6月29日发(作者:)
Puppeteer使⽤⽰例详解PhantomJS曾经是⽆头浏览器⾥的王者,测试、爬⾍等都在使⽤,随着GoogleChrome Headless的出现,PhantomJS的作者已经明确表⽰不在更新,⽽GoogleChrome Headless将会是未来爬⾍的趋势,⽽测试将依然会使⽤Webdriver那套⽅案,GoogleChrome Headless可以利⽤WebDriver调⽤,也可以使⽤其集成的API——Puppeteer(操纵⽊偶的⼈),他的功能和他的名字⼀样强⼤,可以随意操控Chrome或Chromeium,缺点就是只有node的API,来看看他的图标:Puppeteer是基于DevTools协议来控制headless Chrome的Node库,依赖6.4以上版本的node,本⼈是在接触这个软件时才开始学习node,依然感觉到它的异步async/await超级强⼤,在Puppeteer中也⼤量使⽤异步来完成任务。Puppeteer的安装可以使⽤node的包管理⼯具npm来安装:npm i puppeteer这⾥安装时会⾃动安装Chromeium,如果不需要则可以通过配置npm跳过下载,做为⼀名爬⾍⼯程师我不会去讨论测试相关的使⽤,接下来看看如何使⽤,和WebDriver类似,⾸先需要实例化brower,代码如下:const puppeteer = require('puppeteer');(async () => {
const browser = await ();
const page = await e();
await ('');
await ();
})();这段代码执⾏结束时,你可能什么也没有感觉到,因为它在后台启动了⼀个Chromeium进程,打开了百度⾸页,接着就关闭了,当然我们可以在前台打开Chromeium,这⾥就需要配置⼀下,所配置参数只需传⼊launch()即可,常⽤参数如下:headless: 是否打开浏览器,默认为trueignoreHTTPSErrors: 是否忽略https错误,默认为trueexecutablePath: 配置要调⽤浏览器的可执⾏路径,默认是同Puppeteer⼀起安装的ChromeiumslowMo:指定的毫秒延缓Puppeteer的操作args:设置浏览器的相关参数,⽐如是否启动沙箱模式“--no-sandbox”,是否更换代理“--proxy-server”,具体参数请使⽤⽰例如下:const browser = await ({headless:false, args: ["--no-sandbox",]}) //打开浏览器打开新的窗⼝:const page = await e();设置窗⼝⼤⼩await wport({ width: 1920, height: 1080})过滤不需要的请求:await uestInterception(true);('request', interceptedRequest => { if (().endsWith('.png') || ().endsWith('.jpg')) (); else ue();});为浏览器设置userAgent:await rAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299")设置cookie,const data = { name: "smidB2",
domain: ".",
value: "283cf43902aa8991a248f9c605204f92530032f23ef22c16270"}await kie(data)⽰例中只是演⽰,真实的cookie是⼀个列表形式的,需要循环添加才可以for(let data of cookies){ await kie(data)}请求url:const url = ""await (url, { waitUntil: "networkidle2" });设置页⾯等待时间:await r(1000); // 单位是毫秒等待页⾯某个元素加载完成await rSelector("input[class='usrname']")点击某个元素await ("input[class='submit']")利⽤te()函数拖动⿏标⾄页⾯底部,原理就是在页⾯注⼊js代码。let scrollEnable = false;let scrollStep = 500; //每次滚动的步长while (scrollEnable) { scrollEnable = await te((scrollStep) => { let scrollTop = Top; Top = scrollTop + scrollStep; return Height > scrollTop + 1080 ? true : false }, scrollStep); await r(600)}获取html信息const frame = await ame()const bodyHandle = await frame.$('html');const html = await te(body => TML, bodyHandle);await e(); //销毁
(html)
这是爬⾍能⽤到的⼤致操作,以下是爬取⾖瓣热门电影的基本信息和评分的代码,写这个程序时对node也是⼀知半解,如有不对,欢迎留⾔t puppeteer = require("puppeteer")class BasePuppeteer{ puppConfig(){ const config = { headless: false } return config } async openBrower(setting){ const browser = (setting) return browser } async openPage(browser){ const page = await e() return page } async closeBrower(browser){ await () } async closePage(page){ await () }}const pupp = new BasePuppeteer()s = t pupp = require("./")const cheerio = require("cheerio")const mongo = require("mongodb")const assert = require("assert")const MongoClient = lientconst Urls = "mongodb://10.4.251.129:27017/douban"t(Urls, function (err, db) { if (err) throw err; ('数据库已创建'); var dbase = ("runoob"); Collection('detail', function (err, res) { if (err) throw err; ("创建集合!"); (); });});async function getList(){ const brower = await ower() const page = await ge( brower) const url = "/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0" await (url); while(true){ // 循环点击, 直到获取不到该元素 try{ await r(1000);
await rSelector('a[class=more]'); // 等待元素加载完成,超时时间30000ms await ("a[class=more]") // break }catch(err){ (err) ("stop click ") break } }
await r(1000); // 等待页⾯⼀秒 const links = await te(() => { // 获取电影详情url let movies = [...electorAll('.list a[class=item]')]; return ((movie) =>{ return { href: (), } }); }); () for (var i = 0; i < ; i++) { const a = links[i]; await r(2000);
await getDetail(brower, ) // break } await age(page) await rower(brower)
}async function getDetail(brower, url){ const page = await ge(brower) await (url); await r(1000); try{ await (".more-actor", {delay: 20}) }catch(err){ (err) } const frame = await ame() const bodyHandle = await frame.$('html'); const html = await te(body => TML, bodyHandle); await e(); // 销毁
const $ = (html) const title = $("h1 span").text().trim() const rating_num = $(".rating_num").text().trim() const data = {} data["title"] = title data["rating_num"] = rating_num let info = $("#info").text() const keyword = ["director", "screenplay", "lead", "type", "website", "location", "language", "playdate", "playtime", "byname", "imdb"] if (f("www.") > 0){ info = e(/https:|http:/g, "").replace(/t/g," ").replace(/r/g, " ").split(":") for(var i = 1; i < ; i++){ data[keyword[i-1]] = info[i].split(/n/g)[0].replace(/ /g, ",").trim() } }else{ info = e(/t/g," ").replace(/r/g, " ").split(":") (4,1) for(var i = 1; i < -1; i++){ data[keyword[i-1]] = info[i].split(/n/g)[0].replace(/ /g, ",").trim() } data["website"] = "" } // (data) t(Urls,function(err,db){ //获取连接 (null,err); //使⽤断⾔模块代替以前的 if判断 var dbo = ("douban"); tion("detail").insert(data, function(err,result){ //连接到数据库上⾯,并使⽤参数传⼊集合 (null,err); (result); (); }); }); await age(page)}getList()以上的代码完成了对⾖瓣热门电影的全部抓取,有以下⼏个步骤:1, 循环点击加载更多,直到没有此元素可操作⽽抛出异常2,加载完全部热门电影列表时解析出每个电影详情页的url并逐⼀请求3, 解析详情页的所需数据,4,对所抓取数据进⾏⼊库,这⾥使⽤MongoDB⼊库后的数据如下所⽰:对以上的浏览器实例化做了优化,写成了单例模式s = { browserOptions:{ headless: false, // args: ['--no-sandbox', '--proxy-server=proxy:abc100@:8995'], args: ['--no-sandbox'], }};t puppeteer = require("puppeteer");const config = require('./config');//const deasync = require('deasync');const BROWSER_KEY = ('browser');const BROWSER_STATUS_KEY = ('browser_status');launch(rOptions)wait4Lunch();/** * 启动并获取浏览器实例 * @param {*} options * param options is function's options */function launch(options = {}) { if (!global[BROWSER_STATUS_KEY]) { global[BROWSER_STATUS_KEY] = 'lunching'; (options) .then((browser) => { global[BROWSER_KEY] = browser; global[BROWSER_STATUS_KEY] = 'lunched'; }) .catch((err) => { global[BROWSER_STATUS_KEY] = 'error'; throw err; }); }}function wait4Lunch(){ while (!global[BROWSER_KEY] && global[BROWSER_STATUS_KEY] == 'lunching') { // wait for lunch pOnce(); }}s = global[BROWSER_KEY];以上就是本⽂的全部内容,希望对⼤家的学习有所帮助,也希望⼤家多多⽀持。
发布者:admin,转转请注明出处:http://www.yc00.com/news/1687986000a63978.html
评论列表(0条)