- 需求:爬取1996-2023年电影票房排行榜
- 首先,我们先爬取一年的数据,然后通过循环,逐一爬取每一年的数据。通过测试,话费时间32秒,代码如下:
import requests
from lxml import etree
import time
#处理数据,电影排行末尾 有的有空行,有的没有
def str_tool(lst):
if lst:
s = ''.join(lst)
return s.strip()
else:
return ""
#获取某一年的数据
def get_info(year):
url = f"http://www.boxofficecn.com/boxoffice{year}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
res = requests.get(url=url,headers=headers)
tree = etree.HTML(res.text)
trs = tree.xpath("//table/tbody/tr")[1:]
f = open(f'data/{year}.csv','w',encoding='utf-8')
for tr in trs:
num = tr.xpath("./td[1]//text()")
year = tr.xpath("./td[2]//text()")
name = tr.xpath("./td[3]//text()")
money = tr.xpath("./td[4]//text()")
num = str_tool(num)
year = str_tool(year)
name = str_tool(name)
money = str_tool(money)
f.write(f'{num},{year},{name},{money}\n')
f.close()
if __name__ == '__main__':
start = time.time()
for i in range(1996,2024):
get_info(year=i)
end = time.time()
print(f"使用单线程循环获取需要时间:{end - start}") #32.553248167037964
- 下面我们再次通过线程池的方案爬取相同的数据,代码如下:
import requests
from lxml import etree
import time
from multiprocessing.dummy import Pool
#处理数据,电影排行末尾 有的有空行,有的没有
def str_tool(lst):
if lst:
s = ''.join(lst)
return s.strip()
else:
return ""
#获取某一年的数据
def get_info(year):
url = f"http://www.boxofficecn.com/boxoffice{year}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
res = requests.get(url=url,headers=headers)
tree = etree.HTML(res.text)
trs = tree.xpath("//table/tbody/tr")[1:]
f = open(f'data/{year}.csv','w',encoding='utf-8')
for tr in trs:
num = tr.xpath("./td[1]//text()")
year = tr.xpath("./td[2]//text()")
name = tr.xpath("./td[3]//text()")
money = tr.xpath("./td[4]//text()")
num = str_tool(num)
year = str_tool(year)
name = str_tool(name)
money = str_tool(money)
f.write(f'{num},{year},{name},{money}\n')
f.close()
if __name__ == '__main__':
start = time.time()
years = [year for year in range(1996,2024)]
#创建一个线程池,开启了5个线程
pool = Pool(5)
利用线程池中的5个任务,去处理任务,任务个数就是years列表中的元素个数
pool.map(get_info,years)
end = time.time()
print(f"使用线程池获取需要时间:{end - start}") #20.008408784866333
-
通过上面两种方法,可以看到,爬取相同的数据,使用线程池快12s。如果在数据量大时,速度会有更明显的差别,特别是在下载图片视频等IO阻塞的任务时。总结:对应需求中可能出现的上千甚至上万次的客户端请求,“线程池”或“连接池”或许可以缓解部分压力,但是不能解决所有问题。总之,多线程模型可以方便高效的解决小规模的服务请求,但面对大规模的服务请求,多线程模型也会遇到瓶颈。
-
线程池有两种编码方法:第一种使用Pool,上述代码使用,还要一种如下:
from concurrent.futures import ThreadPoolExecutor
start = time.time()
with ThreadPoolExecutor(5) as t:
for y in range(1996,2024):
t.submit(get_info,y)
end = time.time()
print(f"使用线程池获取需要时间:{end - start}")
标签:name,text,year,tr,爬虫,线程,time,电影票房
From: https://www.cnblogs.com/xwltest/p/17060845.html