python多线程实现爬虫任务

时间：2022-12-19 10:37:21浏览次数：45

标签：__ 源代码 python self 爬虫 queue 线程多线程 out

python语言对于网络爬虫来说是非常重要的，大多数互联网公司都热衷于python语言编写爬虫。那么如果大批量做爬虫工作，如何才能快速的爬取数据，这就需要多线程多任务操作才能快速完成爬虫，下文就是多线程代码示例，可以看一看。

程序中设置两个队列分别为queue负责存放网址，out_queue负责存放网页的源代码。

ThreadUrl线程负责将队列queue中网址的源代码urlopen，存放到out_queue队列中。

DatamineThread线程负责使用BeautifulSoup模块从out_queue网页的源代码中提取出想要的内容并输出。

这只是一个基本的框架，可以根据需求继续扩展。

程序中有很详细的注释，如有问题跪求指正。

上代码

import Queue  
import threading  
import urllib2  
import time  
from BeautifulSoup import BeautifulSoup

hosts = ["http://yahoo.com","http://taobao.com","http://apple.com",  
         "http://ibm.com","http://www.amazon.cn"]

queue = Queue.Queue()#存放网址的队列  
out_queue = Queue.Queue()#存放网址页面的队列

class ThreadUrl(threading.Thread):  
    def __init__(self,queue,out_queue):  
        threading.Thread.__init__(self)  
        self.queue = queue  
        self.out_queue = out_queue

    def run(self):  
        while True:  
            host = self.queue.get()  
            url = urllib2.urlopen(host)  
            chunk = url.read()  
            self.out_queue.put(chunk)#将hosts中的页面传给out_queue  
            self.queue.task_done()#传入一个相当于完成一个任务

class DatamineThread(threading.Thread):  
    def __init__(self,out_queue):  
        threading.Thread.__init__(self)  
        self.out_queue = out_queue

    def run(self):  
        while True:  
            chunk = self.out_queue.get()  
            soup = BeautifulSoup(chunk)#从源代码中搜索title标签的内容  
            print soup.findAll(['title'])  
            self.out_queue.task_done()

start = time.time()  
def main():  
    for i in range(5):  
        t = ThreadUrl(queue,out_queue)#线程任务就是将网址的源代码存放到out_queue队列中  
        t.setDaemon(True)#设置为守护线程  
        t.start()

    #将网址都存放到queue队列中  
    for host in hosts:  
        queue.put(host)

    for i in range(5):  
        dt = DatamineThread(out_queue)#线程任务就是从源代码中解析出<title>标签内的内容  
        dt.setDaemon(True)  
        dt.start()

    queue.join()#线程依次执行，主线程最后执行  
    out_queue.join()

main()  
print "Total time :%s"%(time.time()-start)

标签：__,源代码,python,self,爬虫,queue,线程,多线程,out
From： https://blog.51cto.com/u_13488918/5951459

Python 使用filter()去除list的空值
Python使用filter()去除list的空值d=['','剧情','喜剧','恐怖','','伦理','']d_dropna=list(filter(None,d))#去除列表空值，非常简单好用'''注意：空字符串......
Python面向对象
类1.面向对象技术简介类(Class): 用来描述具有相同的属性和方法的对象的集合。它定义了该集合中每个对象所共有的属性和方法。对象是类的实例。方法：类中定义的函数......
Python数据分析5大经典练手项目之项目一(餐厅订单数据分析)【待完结】
环境：shell工具：gitbash(自行下载)，对比cmd：几乎接近linux命令jupyterlab是jupyternotebook升级版实操：桌面右键点击gitbashhere进入mingw64界面输入jupyterlab进入......
[编程基础] Python字符串替换笔记
Python字符串替换笔记Python字符串替换笔记主要展示了如何在Python中替换字符串。Python中有以下几种替换字符串的方法，本文主要介绍前三种。replace方法（常用）translate......
[编程基础] Python随机数生成模块总结
date:2020-06-2421:05:32+0800tags:-编程基础-PythonPython随机数生成模块教程演示如何在Python中生成伪随机数。1介绍1.1随机数字生成器随机数生成器(......
python---基础部分---六种标准数据类型
注意：基本数据类型:python中一切都是对象（class）一、六种标准数据类型：一、NUmber类型：整形，浮点型，复数类型，布尔型，所有数据类型都是以类形......
[编程基础] Python中args和kwargs参数的使用
date:2020-10-1421:04:20+0800tags:-编程基础-Python本文主要介绍Python中*args和**kwargs参数的使用1使用在Python中，定义函数时可以使用两个特殊符号，以......
python调go dll库构造函数方法
调用方法最主要的是 fromctypesimport* 里面包含了windll调用加载方法，具体用windll还是cdll加载动态库见百度接下来就是classGoString(Structure):_fields_=[(......
桌面应用自动化python——uiautomation API 如何找元素
本文主要用到一个uiautomation的开源框架，是一个咱们中国人写的，支持MFC,WindowsForms,WPF,Metro,Qt界面；此文主要是自己的个人总结，开源作者原文：gethub地址：https://g......
Python 资源大全中文版
我想很多程序员应该记得GitHub上有一个Awesome-XXX系列的资源整理。awesome-python 是vinta发起维护的Python资源列表，内容包括：Web框架、网络爬虫、网络......

python多线程实现爬虫任务

相关文章

赞助商

阅读排行