首页 > 编程语言 >python爬取手机壁纸

python爬取手机壁纸

时间:2023-09-27 15:24:30浏览次数:43  
标签:img python 爬取 url urls time 壁纸 page out

无聊随便玩玩,要爬成功还早着呢,代码很乱可以整理,写了就记录一下吧,有机会再改。

import requests
import os
from bs4 import BeautifulSoup
from requests.packages import urllib3
import random
import threading
import time

urllib3.disable_warnings()


start_page = 1
end_page = 1

if not os.path.exists("gq_sjbz"):
    os.makedirs("gq_sjbz")

base_url = "https://www.3gbizhi.com/sjbz/index_{}.html"


time_out_urls = []
print()

def crawl_page(page):
    url = base_url.format(page)
    try:
        user_agent = random.choice(
            ['Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
             'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
             'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'])
        # 有用ip池的自己去买
        headers = {'User-Agent': user_agent}
        resp = requests.get(url, headers=headers, verify=False, timeout=20)
    except requests.exceptions.HTTPError as e:
        print(f"HTTP请求错误: {e}")
        return

    soup = BeautifulSoup(resp.text, 'html.parser')
    ul_element = soup.select("div.contlistw ul.cl")

    for ul in ul_element:
        a_href_s = ul.find_all('a', href=True)
        for a_href in a_href_s:
            href = a_href['href']
            resp2 = requests.get(href, headers=headers, verify=False, timeout=20)
            soup2 = BeautifulSoup(resp2.text, 'html.parser')

            # TODO: 当前只是下载了该界面的展示图,并不是高清原图,下一步下载高清原图,有空再写
            # TODO: 难度要调用接口拿到
            #  {
            #       "file": "/api/user/imageDownload?downconfig=e03UPLG76erry5Fo6ZT7Zw%3D%3D3gbizhiComgV1S3%2BO8DlxWKbNOuZ7BLw%3D%3D&op=file&picnum=1&captcha=267",
            #        "zip": "/api/user/imageDownload?downconfig=e03UPLG76erry5Fo6ZT7Zw%3D%3D3gbizhiComgV1S3%2BO8DlxWKbNOuZ7BLw%3D%3D&op=zip&picnum=1&captcha=267"
            #  }
            #  要绕过人机校验

            ul_element2 = soup2.select("div.img-table img#contpic")
            for ul2 in ul_element2:
                img_url = ul2['src']
                img_name = os.path.basename(img_url)
                try:
                    img_resp = requests.get(img_url, verify=False, timeout=3)
                    if img_resp.status_code == 200:
                        with open("gq_sjbz/" + img_name, "wb") as img_file:
                            img_file.write(img_resp.content)
                        print(f"Downloaded image: {img_url}    超时数量:{len(time_out_urls)}")
                    else:
                        time_out_urls.append(img_url)
                except:
                    time_out_urls.append(img_url)
    print(f"爬取第 {page} 页完成")


# 定义批量爬取的函数
def crawl_batch(start, end):
    for page in range(start, end + 1):
        crawl_page(page)


# 设定每批次爬取的页面数量
batch_size = 20


# 分批次爬取页面
def run_threads():
    threads = []
    for batch_start in range(start_page, end_page + 1, batch_size):
        batch_end = min(batch_start + batch_size - 1, end_page)
        thread = threading.Thread(target=crawl_batch, args=(batch_start, batch_end))
        threads.append(thread)
        thread.start()

    # 等待所有线程完成
    for thread in threads:
        thread.join()


# 启动多线程
run_threads()


def time_out_urls_download(max_download_time):
    start_time = time.time()  # 记录开始时间
    while len(time_out_urls) > 0:
        img_url = time_out_urls.pop(0)  # 从列表中取出第一个URL
        img_name = os.path.basename(img_url)
        try:
            img_resp = requests.get(img_url, verify=False, timeout=3)
            if img_resp.status_code == 200:
                with open("gq_sjbz/" + img_name, "wb") as img_file:
                    img_file.write(img_resp.content)
                print(f"Downloaded image: {img_url}    {len(time_out_urls)}")
            else:
                time_out_urls.append(img_url)
        except Exception as e:
            time_out_urls.append(img_url)

        elapsed_time = time.time() - start_time  # 计算已经过去的时间
        if elapsed_time > max_download_time:
            print(time_out_urls)
            break  # 超过指定时间后停止下载

# 设置最大下载时间(秒)
max_download_time = 1200  # 例如,设置为1小时
# 调用函数开始下载
time_out_urls_download(max_download_time)

print("爬取完成------------------------------------------------------------------------------")


标签:img,python,爬取,url,urls,time,壁纸,page,out
From: https://www.cnblogs.com/Airgity/p/17732794.html

相关文章

  • Python工具箱系列(四十三)
    tar文件操作tar命令是Unix/Linux平台用的最多的命令之一。原始的tar只具备打包和解包的功能:TapeARchive,本义就是把文件打包备份到磁带机。GNU为tar增加了很多新功能,比如支持各种压缩格式。在Unix中一切都是文件:普通文件,文件夹,符号链接,设备文件等等。tar包就是由一个个文件顺序排......
  • python DAY4
    有时候输入时候就可以解决处理问题,比如下面这种做法:  记得这种写法:这个写法算的是从1到x。  当无法判断有多少个输入样例时候,持续输入的大条件可以是: 赋值可以这么写: 这样就能避免赋值错误   for语句实际上是遍历一个集合,上图是遍历字符串 ......
  • python numpy 数组操作
          ......
  • macOS下安装python3
    使用brew安装python3brewinstallpython3Running`brewupdate--auto-update`...==>Downloadinghttps://ghcr.io/v2/homebrew/portable-ruby/portable-ruby/blobs/sha256:61029cec31c68a1fae1fa90fa876adf43d0becff777da793f9b5c5577f00567a#########################......
  • [891] Combine multiple dictionaries in Python
    TocombinemultipledictionariesinPython,youcanuseanyofthemethodsmentionedearlierforcombiningtwodictionaries.Youcanrepeatedlyapplythesemethodstomergemultipledictionariesintoone.Here'showyoucandoit:Usingtheupdate()......
  • Python语法(4)
    Python语法(4)这次我们讲的是字符串,这是我认为特别重要的地方!!!1.字符与整数之间的联系每个常用字符其实都有一个对应的整数表示,二者之间可以相互转化,整数范围大概是-128-127,二者之间是可以相互转化的,但是要注意的是目前没有出现负数与之对应的字符将字符转化成对应的ASCII码用o......
  • 用 Python 绘制现金流量图
    用Python绘制现金流量图最近在学习工程经济学,经常要绘制现金流量图。希望能用Python更方便地绘制现金流量图。但是我在网上找了一圈,发现网上的教程画出来的现金流量图根课本里的不太一样。在网上看到的常见的教程里面告诉你的方法都是直接把现金流量图绘制成柱状图或者折线图......
  • Python 基础知识结构
    一、关键字1、return2、if3、elif4、else5、for6、while二、内置函数1、print()2、max()3、min()4、len()5、range()6、enumerate()4、input()5、type()6、int()三、运算符+-*///=+=-=>>===%三目运算符in成员运算符1、算数运算符+加数字与字符串拼接-......
  • python从摄像头读取数据并在网页上显示
    importcv2fromflaskimportFlask,render_template,Responseapp=Flask(__name__)camera=cv2.VideoCapture(0)defgenerate_frames():whileTrue:success,frame=camera.read()ifnotsuccess:breakelse:......
  • python2.7 pip install pyyaml 安装出现错误
    conda环境python2.7 安装pyyaml:pipinstallpyyaml错误如下: ERROR:Commanderroredoutwithexitstatus1:  command:bin/python2.7/python2.7/site-packages/pip/_vendor/pep517/_in_process.pyget_requires_for_build_wheel/tmp/tmp4If62U    估计是......