Python爬虫爬取wallhaven.cc图片

时间：2023-11-28 11:45:57浏览次数：41

标签：img Python resp wallhaven 爬取线程 print import alt

话不多说，直接上代码！

 1 import time
 2 import random
 3 import uuid
 4 
 5 from lxml import etree
 6 import os
 7 import requests
 8 import threading
 9 from queue import Queue
10 from tqdm import tqdm
11 from concurrent.futures import ThreadPoolExecutor
12 
13 # 队列
14 q = Queue(maxsize=300)
15 # 线程池
16 pool = ThreadPoolExecutor(max_workers=10)
17 
18 headers = {
19     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 '
20                   'Safari/537.36 '
21 }
22 
23 
24 # 线程池中的方法，解析URL将图片地址和标题放进队列
25 def utl_image(url, cursor):
26     # 1. 请求URL
27     resp = requests.get(url, headers=headers)
28     resp.encoding = resp.apparent_encoding
29     page_content = resp.text
30     # 2. 数据解析
31     tree = etree.HTML(page_content)
32     # 图片地址
33     hrefs = tree.xpath('//a[@class="preview"]/@href')
34     print(f'解析到图片{len(hrefs)}张！')
35     for a_href in tqdm(hrefs, desc='hrefs，插入队列'):
36         time.sleep(random.randint(1, 5))
37         resp_ah = requests.get(str(a_href), headers=headers)
38         resp_ah.encoding = resp_ah.apparent_encoding
39         pct = resp_ah.text
40         # 2. 数据解析
41         tre = etree.HTML(pct)
42         print(f'pct:{pct}')
43         img_list = tre.xpath('//img[@id="wallpaper"]')
44         print(f'线程{cursor},a_href:{a_href},img_list:{img_list}')
45         for img in img_list:
46             img_src = img.xpath('./@src')[0]
47             img_alt = img.xpath('./@alt')[0]
48             q.put([img_src, img_alt])
49 
50 
51 # 线程中的方法，获取队列中的图片地址和标题，进行存储
52 def image_save():
53     threadName = uuid.uuid1()
54     pCoun = 1
55     while True:
56         print(f'队列【消费】，当前队列消息总数：{q.qsize()}')
57         print(f'线程{threadName},开始消费第{pCoun}条消息！')
58         img_src, img_alt = q.get(timeout=180)
59         # 获取网页信息
60         r = requests.get(url=str(img_src), headers=headers)
61         path = f"wallhaven"
62         try:
63             if not os.path.exists(path):
64                 os.makedirs(path)
65         except:
66             print(f"创建目录失败{path}")
67         if r.status_code == 200:
68             print(f"{img_alt}下载[start]")
69             f = open(f"{path}/{img_alt[:40]}.jpg", 'wb')
70             f.write(r.content)
71             f.close()
72             print(f"{img_alt}下载[ok]")
73             time.sleep(random.randint(1, 3))
74         pCoun = pCoun+1
75 
76 
77 if __name__ == '__main__':
78     # 1. 准备数据
79     url = 'https://wallhaven.cc/search?categories=001&purity=100&ratios=landscape&topRange=1y&sorting=toplist&order=desc&ai_art_filter=1&page='
80     url_list = [f'{url}{i}' for i in range(1, 2)]  # 指定页区间，包头不包尾
81     i = 1
82     for ul in tqdm(url_list, desc='线程池'):
83         print(f'ul:{ul}')
84         # 使用线程池进行URL的解析
85         pool.submit(utl_image, ul, i)
86         time.sleep(1)
87         i = i + 1
88     # 使用多线程下载，数字几就是几个线程
89     for i in range(1):
90         t = threading.Thread(target=image_save)
91         t.start()

By the way:
4K-8k资源分享：https://www.cnblogs.com/kukuDF/p/15989961.html

标签：img,Python,resp,wallhaven,爬取,线程,print,import,alt
From： https://www.cnblogs.com/kukuDF/p/17861528.html

python--变量和简单数据类型
Python--变量和简单数据类型目录Python--变量和简单数据类型一、Python脚本运行过程二、变量1、变量的命名和使用2、python关键字和内置函数2.1、python关键字2.2、python内置函数3、使用变量时避免命名错误4、变量是标签三、字符串1、使用方法修改字符串的大小写2、在字符串中使......
python中pip下载慢或报错的解决方法
一：问题python的pip在安装包时，有时会报错超时，排除包名写错的原因，一般这种问题是因为网络下载过慢，导致超时二：解决方案我们可以设置pip镜像源下载，能够提升pip下载速度，解决报错问题具体操作是把全局的镜像地址设置成阿里云服务：pipconfigsetglobal.index-urlhttps://mirror......
根据累进税率计算每月个人所得税 python代码
使用时将工资、社保和公积金替换即可，累进税率表和起征点根据所在当地调整importnumpyasnp#累进税率表：交税比例及速算扣除数tax_rates={36000:{"tax_rate":0.03,"quick_deduction":0},144000:{"tax_rate":0.1,"quick_deduction":2520},300000:{&quo......
Python之Http服务设置跨域请求
Http服务设置跨域请求跨域是什么就不在此进行赘述了，百度一下,你就知道。flask的处理方法flask中处理跨域很简单，只需要在flask的app对象中注册函数处理:app.after_request(after_request)其中的after_request即为处理跨域的函数，当请求处理完成，还未响应给客户端之前，flask会......
python pandas绘图
pandas绘图导包importmatplotlib.pyplotasplt#进行图形绘制的常用模块。#结合Pandas和Matplotlib.pyplot，您可以在数据分析和可视化方面有更多的灵活性。折线图#折线图s=pd.Series([100,200,300,200,150,80])s.plot()使用了Pandas的Series对象，并调用了其......
python基础类(二) 类的封装与属性隐藏
封装创建一个实例对象后,事先定义在类的函数就成为该实例对象的函数即方法,事先定义在类中绑定在self上的变量成为该实例对象的数据,各个实例对象的方法和数据是互相独立的,互干扰影响类的实例对象可以调用函数即方法,这样通过方法来访问或者修改属于该实例对象的数据,就是所谓......
Python用偏最小二乘回归Partial Least Squares，PLS分析桃子近红外光谱数据可视化
全文链接：https://tecdat.cn/?p=34376原文出处：拓端数据部落公众号PLS，即偏最小二乘（PartialLeastSquares），是一种广泛使用的回归技术，用于帮助客户分析近红外光谱数据。如果您对近红外光谱学有所了解，您肯定知道近红外光谱是一种次级方法，需要将近红外数据校准到所要测量的参数的主要......
R语言和Python对copula模型Gaussian、t、Clayton 和 Gumbel 族可视化理论概念和文献计
原文链接：http://tecdat.cn/?p=27240 原文出处：拓端数据部落公众号最近我们被客户要求撰写关于copula的研究报告，包括一些图形和统计输出。本文包含一些直观的示例来说明copula理论的核心概念。以下是脚本及其各自用途的简短列表：首先演示如何使用高斯copula来模拟具有任意......
Python实现完全二叉树
给定一个元素序列(如列表)，递归的创建一颗完全二叉树完整代码如下#!/usr/bin/envpython3classTreeNode:"""Nodeofcompletetree"""def__init__(self,data=0):self.data=dataself.left=Noneself.right=Nonedefb......
关于python中pip安装库过程中报错的解决办法
晚上使用flask框架搭建web服务，启动后报错如下：WARNING:Thisisadevelopmentserver.Donotuseitinaproductiondeployment.UseaproductionWSGIserverinstead. 经过搜索大法，发现是因为在开发环境中，Flask应用程序是使用内置的服务器（如SimpleServer或Lighttpd）运......

Python爬虫爬取wallhaven.cc图片

相关文章

赞助商

阅读排行