首页 > 其他分享 >10-验证-中文识别点选

10-验证-中文识别点选

时间:2024-02-11 09:33:05浏览次数:38  
标签:点选 10 中文 bg word driver tag import div

image-20231207092150833

image-20231207092214884

1.获取图片

image-20231207093038043

# @课程   : 爬虫逆向实战课
# @讲师   : 武沛齐
# @课件获取: wupeiqi666

import re
import time
import ddddocr
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import ActionChains

service = Service("driver/chromedriver.exe")
driver = webdriver.Chrome(service=service)

# 1.打开首页
driver.get('https://www.geetest.com/adaptive-captcha-demo')

# 2.点击【文字点选验证】
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.XPATH,
    '//*[@id="gt-showZh-mobile"]/div/section/div/div[2]/div[1]/div[2]/div[3]/div[4]'
))
tag.click()

# 3.点击开始验证
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.CLASS_NAME,
    'geetest_btn_click'
))
tag.click()

time.sleep(5)

# 要识别的目标图片
target_tag = driver.find_element(
    By.CLASS_NAME,
    'geetest_ques_back'
)
target_tag.screenshot("target.png")

# 识别图片
bg_tag = driver.find_element(
    By.CLASS_NAME,
    'geetest_bg'
)
bg_tag.screenshot("bg.png")

time.sleep(2000)
driver.close()

2.目标识别

截图每个字符,并基于ddddocr识别。

image-20231207093226288

# @课程   : 爬虫逆向实战课
# @讲师   : 武沛齐
# @课件获取: wupeiqi666

import re
import time
import ddddocr
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import ActionChains

service = Service("driver/chromedriver.exe")
driver = webdriver.Chrome(service=service)

# 1.打开首页
driver.get('https://www.geetest.com/adaptive-captcha-demo')

# 2.点击【滑动拼图验证】
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.XPATH,
    '//*[@id="gt-showZh-mobile"]/div/section/div/div[2]/div[1]/div[2]/div[3]/div[4]'
))
tag.click()

# 3.点击开始验证
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.CLASS_NAME,
    'geetest_btn_click'
))
tag.click()

# 4.等待验证码出来
time.sleep(5)

# 5.识别任务图片
target_word_list = []
parent = driver.find_element(By.CLASS_NAME, 'geetest_ques_back')
tag_list = parent.find_elements(By.TAG_NAME, "img")

for tag in tag_list:
    ocr = ddddocr.DdddOcr(show_ad=False)
    word = ocr.classification(tag.screenshot_as_png)
    target_word_list.append(word)

print("要识别的文字:", target_word_list)

time.sleep(2000)
driver.close()

3.背景坐标识别

image-20231207093633102

识别背景中的文字,并获得字体的坐标(后续需按照顺序点击)

3.1 ddddocr

image-20231207095157782

能识别,但是发现默认识别率有点低,想要提升识别率,可以搭建Pytorch环境对模型进行训练,参考:https://github.com/sml2h3/dddd_trainer

# @课程   : 爬虫逆向实战课
# @讲师   : 武沛齐
# @课件获取: wupeiqi666

import re
import time
import ddddocr
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import ActionChains
from PIL import Image, ImageDraw
from io import BytesIO

service = Service("driver/chromedriver.exe")
driver = webdriver.Chrome(service=service)

# 1.打开首页
driver.get('https://www.geetest.com/adaptive-captcha-demo')

# 2.点击【滑动拼图验证】
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.XPATH,
    '//*[@id="gt-showZh-mobile"]/div/section/div/div[2]/div[1]/div[2]/div[3]/div[4]'
))
tag.click()

# 3.点击开始验证
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.CLASS_NAME,
    'geetest_btn_click'
))
tag.click()

# 4.等待验证码出来
time.sleep(5)

# 5.识别任务图片
target_word_list = []
parent = driver.find_element(By.CLASS_NAME, 'geetest_ques_back')
tag_list = parent.find_elements(By.TAG_NAME, "img")
for tag in tag_list:
    ocr = ddddocr.DdddOcr(show_ad=False)
    word = ocr.classification(tag.screenshot_as_png)
    target_word_list.append(word)

print("要识别的文字:", target_word_list)

# 6.背景图片
bg_tag = driver.find_element(
    By.CLASS_NAME,
    'geetest_bg'
)
content = bg_tag.screenshot_as_png

# 7.识别背景中的所有文字并获取坐标
ocr = ddddocr.DdddOcr(show_ad=False, det=True)
poses = ocr.detection(content) # [(x1, y1, x2, y2), (x1, y1, x2, y2), x1, y1, x2, y2]

# 8.循环坐标中的每个文字并识别
bg_word_dict = {}
img = Image.open(BytesIO(content))

for box in poses:
    x1, y1, x2, y2 = box
    # 根据坐标获取每个文字的图片
    corp = img.crop(box)
    img_byte = BytesIO()
    corp.save(img_byte, 'png')
    # 识别文字
    ocr2 = ddddocr.DdddOcr(show_ad=False)
    word = ocr2.classification(img_byte.getvalue())  # 识别率低

    # 获取每个字的坐标  {"鸭":}
    bg_word_dict[word] = [int((x1 + x2) / 2), int((y1 + y2) / 2)]

print(bg_word_dict)

time.sleep(1000)
driver.close()

3.2 打码平台

https://www.chaojiying.com/

image-20231206120854606

image-20231206120825599

# @课程   : 爬虫逆向实战课
# @讲师   : 武沛齐
# @课件获取: wupeiqi666

import base64
import requests
from hashlib import md5

file_bytes = open('5.jpg', 'rb').read()

res = requests.post(
    url='http://upload.chaojiying.net/Upload/Processing.php',
    data={
        'user': "wupeiqi",
        'pass2': md5("密码".encode('utf-8')).hexdigest(),
        'codetype': "9501",
        'file_base64': base64.b64encode(file_bytes)
    },
    headers={
        'Connection': 'Keep-Alive',
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
    }
)

res_dict = res.json()
print(res_dict)
# {'err_no': 0, 'err_str': 'OK', 'pic_id': '1234612060701120002', 'pic_str': '的,86,73|粉,111,38|菜,40,49|香,198,101', 'md5': 'faac71fc832b2ead01ffb4e813f3be60'}

结合极验案例截图+识别:

image-20231207095921025

# @课程   : 爬虫逆向实战课
# @讲师   : 武沛齐
# @课件获取: wupeiqi666

import re
import time
import ddddocr
import requests
import base64
import requests
from hashlib import md5
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import ActionChains
from PIL import Image, ImageDraw
from io import BytesIO

service = Service("driver/chromedriver.exe")
driver = webdriver.Chrome(service=service)

# 1.打开首页
driver.get('https://www.geetest.com/adaptive-captcha-demo')

# 2.点击【滑动拼图验证】
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.XPATH,
    '//*[@id="gt-showZh-mobile"]/div/section/div/div[2]/div[1]/div[2]/div[3]/div[4]'
))
tag.click()

# 3.点击开始验证
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.CLASS_NAME,
    'geetest_btn_click'
))
tag.click()

# 4.等待验证码出来
time.sleep(5)

# 5.识别任务图片
target_word_list = []
parent = driver.find_element(By.CLASS_NAME, 'geetest_ques_back')
tag_list = parent.find_elements(By.TAG_NAME, "img")
for tag in tag_list:
    ocr = ddddocr.DdddOcr(show_ad=False)
    word = ocr.classification(tag.screenshot_as_png)
    target_word_list.append(word)

print("要识别的文字:", target_word_list)

# 6.背景图片
bg_tag = driver.find_element(
    By.CLASS_NAME,
    'geetest_bg'
)
content = bg_tag.screenshot_as_png
bg_tag.screenshot("bg.png")

# 7.识别背景中的所有文字并获取坐标
res = requests.post(
    url='http://upload.chaojiying.net/Upload/Processing.php',
    data={
        'user': "wupeiqi",
        'pass2': md5("密码".encode('utf-8')).hexdigest(),
        'codetype': "9501",
        'file_base64': base64.b64encode(content)
    },
    headers={
        'Connection': 'Keep-Alive',
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
    }
)

res_dict = res.json()
print(res_dict)

# 8.每个字的坐标  {"鸭":(196,85), ...}    target_word_list = ["花","鸭","字"]
bg_word_dict = {}
for item in res_dict["pic_str"].split("|"):
    word, x, y = item.split(",")
    bg_word_dict[word] = (x, y)
    
print(bg_word_dict)

time.sleep(1000)
driver.close()

4.坐标点击

根据坐标,在验证码上进行点击。

ActionChains(driver).move_to_element_with_offset(标签对象, xoffset=x, yoffset=y).click().perform()

image-20231207154418322

# @课程   : 爬虫逆向实战课
# @讲师   : 武沛齐
# @课件获取: wupeiqi666

import re
import time
import ddddocr
import requests
import base64
import requests
from hashlib import md5
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import ActionChains
from PIL import Image, ImageDraw
from io import BytesIO

service = Service("driver/chromedriver.exe")
driver = webdriver.Chrome(service=service)

# 1.打开首页
driver.get('https://www.geetest.com/adaptive-captcha-demo')

# 2.点击【滑动拼图验证】
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.XPATH,
    '//*[@id="gt-showZh-mobile"]/div/section/div/div[2]/div[1]/div[2]/div[3]/div[4]'
))
tag.click()

# 3.点击开始验证
tag = WebDriverWait(driver, 30, 0.5).until(lambda dv: dv.find_element(
    By.CLASS_NAME,
    'geetest_btn_click'
))
tag.click()

# 4.等待验证码出来
time.sleep(5)

# 5.识别任务图片
target_word_list = []
parent = driver.find_element(By.CLASS_NAME, 'geetest_ques_back')
tag_list = parent.find_elements(By.TAG_NAME, "img")
for tag in tag_list:
    ocr = ddddocr.DdddOcr(show_ad=False)
    word = ocr.classification(tag.screenshot_as_png)
    target_word_list.append(word)

print("要识别的文字:", target_word_list)

# 6.背景图片
bg_tag = driver.find_element(
    By.CLASS_NAME,
    'geetest_bg'
)
content = bg_tag.screenshot_as_png

# bg_tag.screenshot("bg.png")

# 7.识别背景中的所有文字并获取坐标
res = requests.post(
    url='http://upload.chaojiying.net/Upload/Processing.php',
    data={
        'user': "wupeiqi",
        'pass2': md5("自己密码".encode('utf-8')).hexdigest(),
        'codetype': "9501",
        'file_base64': base64.b64encode(content)
    },
    headers={
        'Connection': 'Keep-Alive',
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
    }
)

res_dict = res.json()

bg_word_dict = {}
for item in res_dict["pic_str"].split("|"):
    word, x, y = item.split(",")
    bg_word_dict[word] = (x, y)

print(bg_word_dict)
# target_word_list = ['粉', '菜', '香']
# bg_word_dict = {'粉': ('10', '10'), '菜': ('50', '50'), '香': ('100', '93')}
# 8.点击
for word in target_word_list:
    time.sleep(2)
    group = bg_word_dict.get(word)
    if not group:
        continue
    x, y = group
    x = int(x) - int(bg_tag.size['width'] / 2)
    y = int(y) - int(bg_tag.size['height'] / 2)
    ActionChains(driver).move_to_element_with_offset(bg_tag, xoffset=x, yoffset=y).click().perform()

time.sleep(1000)
driver.close()

标签:点选,10,中文,bg,word,driver,tag,import,div
From: https://www.cnblogs.com/fuminer/p/18013176

相关文章

  • 10.使用RestSharps请求WebAPI
    1.请求类publicclassBaseRequest{///<summary>///请求法式///</summary>publicRestSharp.MethodMethod{get;set;}///<summary>///路由///</summary>publicstr......
  • 读千脑智能笔记10_人类智能存在的风险
    1. 人类智能存在的风险1.1. “末日时钟”1.1.1. 核战争引发的大火列为地球毁灭的主要原因1.1.2. 气候变化列为人类自我毁灭的第二大潜在原因1.2. 除非我们刻意加入自私的驱动力、动机或情感,否则智能机器并不会威胁到人类的生存1.2.1. 人类在不远的将来会创造出更多的......
  • 2024/2/10学习进度笔记
    RDD,学名可伸缩的分布式数据集(ResilientDistributedDataset)。是一种对数据集形态的抽象,基于此抽象,使用者可以在集群中执行一系列计算,而不用将中间结果落盘。而这正是之前MR抽象的一个重要痛点,每一个步骤都需要落盘,使得不必要的开销很高。对于分布式系统,容错支持是必不可少的。......
  • P1102 A-B 数对
    原题链接解法一:二分搜素首先我们知晓A-B=C,那么A=B+C,我们只需要遍历数组中的每一个元素然后在数组中搜素a[i]+c的值是否存在即可。Code #include<bits/stdc++.h>usingnamespacestd;typedeflonglongll;constintN=2e5+5;lla[N];intmain(){intn,c;l......
  • 编辑显示打印中文乱码
    在VSCode中显示的中文正常但打印乱码。打印别的中文正常。原因:该文件只是用正确的格式编码打开却还没有用该编码保存解法:  如果此时显示乱码,只需ctrl+z即可 效果: ......
  • 创新指南|企业实施Gen AI面临的10大挑战
    文章列出了实现生成式人工智能(GenAI)成功的十大挑战。这些挑战涵盖了从数据管理和法律法规到处理能力、解释能力、准确性问题等多个方面。文章强调,尽管GenAI具有巨大的潜力,但要克服这些挑战以实现其在业务中的有效应用。这些挑战反映了目前GenAI领域面临的关键问题和正在进行的......
  • NOI 2010 做题笔记
    NOI2010Day1T1能量采集观察到\((0,0)\)与\((x,y)\)连线上的整点个数正好是\(\gcd(x,y)-1\)(不包括端点),于是总能量损失即为:\[\begin{aligned}{}&\sum\limits_{T=l}^{r}f(T,c)\sum\limits_{i=0}^{n}p_iT^i\\=&\sum\limits_{i=0}^{n}p_i\sum\limits_{T......
  • P2985 [USACO10FEB] Chocolate Eating S
    原题链接题解看到使最不开心的一天尽可能的开心,这是要使最小值尽可能的不小,二分思路由此而来,剩余的就是贪心模拟最坏时间复杂度约为$O(d·sum(H))≈5·10^4·log2(5·10^{10})≈1777060.45$坑点:剩下的巧克力要在最后一天全部吃完\(Code\)#include<bits/stdc++.h>#d......
  • 通达信横盘启动买点选股指标公式源码副图
    {股票指标}LOWV:=LLV(LOW,15);HIGHV:=HHV(HIGH,15);HIGHVV:=HHV(HIGH,40);V0:=V;V1:=REF(V,1);V2:=REF(V,2);V3:=REF(V,3);HV22:=HHV(V,35);D0U:=C>O;D1U:=REF(C,1)>=REF(C,2);横盘:=((HIGHV-LOWV)/HIGHV)<0.15;放量1:=V0>=(1.8*((V1+V2+V3)/3));放量2:=V0&......
  • 代码随想录算法训练营第十六天| 104.二叉树的最大深度 559.n叉树的最大深度 111.二
    104.二叉树的最大深度  题目链接:104.二叉树的最大深度-力扣(LeetCode)n叉树也一样思路:我的普通递归方法classSolution{public:intdepth(TreeNode*node,intd){intl=0,r=0;if(node->left==NULL&&node->right==NULL)returnd;if(node-......