2023年6月27日发(作者:)
python爬⾍解决极验验证码问题使⽤2captcha的服务解决登录时的极验验证码爬⾍在抓⽹站数据时,不可避免要和验证码做长久⽃争。当然能绕过最好,但是总有绕不过的验证码,此时,对于简单的可以尝试绕过,有难度的对接打码平台。现在验证码多种多样,点选,滑动,英⽂字母组合等。所以要爬取⼀些⽹站的数据也越来越⿇烦。 本篇就针对简书的依次点击⽂字的验证码进⾏破解 1、简书的验证技术采⽤的是极验提供的。
2、它的验证⽅式是:按顺序点击图中的⽂字来验证的。 3、使⽤第三⽅平台2captcha的服务2captcha是收费的,这也是为了⽅便快速破掉验证码。卷积神经⽹络(CNN)已经学会了如何绕过最简单的验证码类型。当然也因此,验证码也早不断更新,变得更加复杂——事实上 ,这场验证码与机器学习的⽐赛永远不会结束。基于这个现状,⽬前有真实⼈类识别的在线反验证码服务还是要暂时领先于这些机器学习的解决⽅案...本⽂只是利⽤2captcha来破解的,2captcha就是专门做这类机器学习的,它们有强⼤的⼈⼒物⼒专门做各种验证码,并且识别率⾮常⾼,现在⼀般都是90%以上,价格也还⾏,3美元⼏百次吧。2captcha打码平台参数分析 ⾸先看⼀下2captcha的⼤概操作,打开官⽹因为是外国⽹站,所以看不懂的直接利⽤翻译就对了。 右上⾓sign in登录账号登录完成后,会⾃动跳到主页 主要关注圈到的地⽅,15$是指余额,可以⽤很多次了~ 上⾯API就是我们主要看的地⽅, 下⾯圈起来的地⽅是后⾯请求2captcha接⼝需要的唯⼀key。然后就是看⽂档了,看看具体操作,点击进⼊api找到极验就⾏了,GeeTest
点击GeeTest右键翻译下页⾯.. 看着还是挺简单的亚⼦,主要就5个步骤⽂档已经写得很明⽩了,照着来就ok。开始试⼀下 打开简书⽹站,点击登录到登录页⾯,f12开始寻找gt,challenge,api_server三个东西, 点击Network,刷新⽹页,重新加载所有请求,crtl+f,搜索challenge,发现new这个接⼝返回的是这个不过2captcha⽂档 说了,通常可以在initGeetest发现他,我们尝试下 点击Elements,按ctrl+shift+f全局搜索⼀下,搜索 initGeetest还真有这个,我们打上断点,再次刷新,匹配⼀下是否和network⾥的⼀样知道了gt和challenge,接下来看看api_server随便输⼊账号密码点击登录⼀下,触发⼀下极验,在elements中,搜索api_server
很容易就找到了api_server,api_server是固定参数,图中圈出来的就是Ok,我们来请求⼀下commbine_header = { "Accept": "application/json, text/plain, */*", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", "Host": "", "Referer": "/sign_in", "Sec-Fetch-Mode": "cors", "Sec-Fetch-Site": "same-origin", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36" } commbine_url = "/captchas/new?t=43-mba " response_commbine = (url=commbine_url, headers=commbine_header) print(response_)输出结果是定义参数:注:challenge是动态的,其他的是静态的,API_KEY是2captcha平台的key。请求代码def getCaptchaResult(challenge): r = (captcha_url) print(()) rid = ().get("request") # print(rid, type(rid)) (15) while True: re_cpatcha_url = f"/?key={API_KEY}&action=get&id={int(rid)}&json=1" # print(re_cpatcha_url) r2 = (re_cpatcha_url) print(()) if ().get("status") == 1: geetest_challenge = ().get("request").get("geetest_challenge") geetest_validate = ().get("request").get("geetest_validate") geetest_seccode = ().get("request").get("geetest_seccode") return geetest_challenge, geetest_validate, geetest_seccode captcha_url = f"/?key={API_KEY}&method={method}>={gt}&challenge={challenge}&pageurl={pageurl}&api_server={api_server}&js运⾏成功拿到了请求2captcha返回的值 参数都拿到了,只需要带着相关参数请求简书的登录接⼝了,在Network经过⼀番查找后,查到下⾯的这个接⼝。输⼊⼀个错误的密码,再输⼊正确的验证码就会请求下⾯的接⼝,但是密码不对登不进去。由此可断定,当验证码填写正确后应该是回调了这个⽅法
这种应该就是直接通过form表单提交的,然后我们看元素代码可以发现,确实就是这样的。 所有参数都能拿到了,接着就是带着这参数访问下,试试:#请求⽅法def login_v2(): #在浏览器中查找请求头相关参数 login_v2_header = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", "Host": "", "Referer": "/sessions", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "same-origin", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36" } login_v2_url = "/sessions" r1 = (login_v2_url, headers=login_v2_header, data=login_v2_dict)f __name__ == '__main__': username = "177*****72" #你的⽤户名 password = "******" #你的密码
#根据浏览器FormData参数获取所有请求的参数 login_v2_dict = { "authenticity_token": "c3bzFTdK9fogAI8v=k9A9+WKK45iRh67CplbMqweXLvVHor4U=ShNIvNOOlShByTXkgcqfsIg==",
"session[email_or_mobile_number]": username,
"session[password]": password, "session[oversea]": False, "session[remember_me]": True, # 通过 2captcha 获取 "captcha[validation][challenge]": "", "captcha[validation][gt]": gt, "captcha[validation][validate]": "", "captcha[validation][seccode]":"" } v2_gt,challenge = getChallengeAndGt() geetest_challenge,geetest_validate,geetest_seccode=getCaptchaResult(challenge) login_v2_dict["captcha[validation][challenge]"] = geetest_challenge login_v2_dict["captcha[validation][validate]"] = geetest_validate login_v2_dict["captcha[validation][seccode]"] = geetest_seccode print(login_v2_dict) login_v2() "Cookie": "__yadk_uid=vQPJojw7T; read_mode=day; default_font=font2; locale=zh-CN; web_login_version=MTU3ODcxOTQ5OA%3D%3D--5bde69ffd82246
执⾏结果
⾄于破掉验证码后要⼲嘛⼜能⼲嘛,那到底有哪些⽤处呢:⾸先当然是爬数据了,现在很多⽹站都是需要登录才⾏的;然后还有可以做⾃动登录啊,批量注册啊等等。。你们可以⾃⼰试⼀下看。完整代码from pprint import pprintimport timeimport randomimport requestsimport base64import refrom bs4 import BeautifulSoupAPI_KEY = "841545********741577a93e1d0b"method = "geetest"gt = "ec476419********681a247db3c92e"# challenge = "4ad9c6d********d8584f9b35769b0e"pageurl = "/sign_in"api_server = ""def getCsrfToken(): headers = { 'accept': 'text/html,application/xhtml+xml,application/xml', 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } id_name = "authenticity_token" response = (pageurl, headers=headers, verify=False) patt_id_tag = """<[^>]*name=['"]?""" + id_name + """['" ][^>]*>""" id_tag = l(patt_id_tag, , |CASE) if id_tag: id_tag = id_tag[0] one=id_("value=")[1].split(""") return one[1] #创建 Beautiful Soup 对象 # soup = BeautifulSoup(,"") # #print(fy()) # idVal = fy().find_all(name="authenticity_token")[0]['value'] #print(idValue)def getChallengeAndGt(): commbine_header = { "Accept": "application/json, text/plain, */*", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", "Host": "", "Referer": "/sign_in", "Sec-Fetch-Mode": "cors", "Sec-Fetch-Site": "same-origin", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36" } commbine_url = "/captchas/new?t=43-mba " response_commbine = (url=commbine_url, headers=commbine_header) print(response_) #print(response_()) gt = response_().get("gt") challenge = response_().get("challenge") return gt, challenge return gt, challengedef getCaptchaResult(challenge): r = (captcha_url) #print(()) rid = ().get("request") # print(rid, type(rid)) (15) while True: re_cpatcha_url = f"/?key={API_KEY}&action=get&id={int(rid)}&json=1" # print(re_cpatcha_url) r2 = (re_cpatcha_url) print(()) if ().get("status") == 1: geetest_challenge = ().get("request").get("geetest_challenge") geetest_validate = ().get("request").get("geetest_validate") geetest_seccode = ().get("request").get("geetest_seccode") return geetest_challenge, geetest_validate, geetest_seccodedef login_v2(): login_v2_header = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", "Host": "", "Referer": "/sessions", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "same-origin", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36" } login_v2_url = "/sessions" rust=(login_v2_url, headers=login_v2_header, data=login_v2_dict) print(rust) print(s) print(_dict()) # print()if __name__ == '__main__':
username = "177******72" password = "z2********9" login_v2_dict = { "utf8":"✓", "authenticity_token": "",
"session[email_or_mobile_number]": username,
"session[password]": password, "session[oversea]": False,
# 通过 2captcha 获取 "captcha[validation][challenge]": "", "captcha[validation][gt]": gt, "captcha[validation][validate]": "", "captcha[validation][seccode]":"", "session[remember_me]": True } v2_gt,challenge = getChallengeAndGt() geetest_challenge,geetest_validate,geetest_seccode=getCaptchaResult(challenge) captcha_url = f"/?key={API_KEY}&method={method}>={gt}&challenge={challenge}&pageurl={pageurl}&api_server={api_server}&js geetest_challenge,geetest_validate,geetest_seccode=getCaptchaResult(challenge) token = getCsrfToken()
login_v2_dict["authenticity_token"] = token login_v2_dict["captcha[validation][challenge]"] = geetest_challenge login_v2_dict["captcha[validation][validate]"] = geetest_validate login_v2_dict["captcha[validation][seccode]"] = geetest_seccode print(login_v2_dict) login_v2()
发布者:admin,转转请注明出处:http://www.yc00.com/web/1687866486a52118.html
评论列表(0条)