首页 > 系统相关 >Ubuntu 10.04 安装Twisted、Scrapy爬虫框架

Ubuntu 10.04 安装Twisted、Scrapy爬虫框架

时间:2023-09-15 14:03:43浏览次数:33  
标签:11 python scrapy Twisted 19 Scrapy Ubuntu 2012


Ubuntu 10.04 安装Twisted、Scrapy爬虫框架



Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结 Scrapy Python爬虫框架 logo[1]构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。Scrapy吸引人的地方在于它是一个框架,任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类,如BaseSpider、sitemap爬虫等,最新版本又提供了web2.0爬虫的支持。


 



准备工作


Requirements


Python 2.5, 2.6, 2.7 (3.x is not yet supported)


Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)



w3lib



lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)



simplejson (not required if using Python 2.6 or above)



pyopenssl (for HTTPS support. Optional, but highly recommended)



---------------------------------------------



Twisted安装过程



sudo apt-get install python-twisted python-libxml2 python-simplejson



安装完成后进入python,测试Twisted是否安装成功



pycrypto



wget http://pypi.python.org/packages/source/p/pycrypto/pycrypto-2.5.tar.gz#md5=783e45d4a1a309e03ab378b00f97b291



tar -zxvf pycrypto-2.5.tar.gz



cd pycrypto-2.5



sudo python setup.py install



 



/etc/host,/etc/hostname 要一致,否则报错



 



python版本:2.6.5 更新一下,否则报gcc返回状态不对

sudo apt-get install python-dev




pycrypto



wget http://pypi.python.org/packages/source/p/pycrypto/pycrypto-2.5.tar.gz#md5=783e45d4a1a309e03ab378b00f97b291



tar -zxvf pycrypto-2.5.tar.gz



cd pycrypto-2.5



sudo python setup.py install


当python2.6.5 时安装

pycrypto

warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath. ubuntu

python更新成 2.7版本后,警告消失

wget -c http://www.python.org/ftp/python/2.7/Python-2.7.tar.bz2

tar -xvjpf Python-2.7.tar.bz2

 

cd Python-2.7

./configure

make

sudo make altinstall

cd /usr/bin 

mv python python.bak

mv python-config python-config.bak

mv python2 python2.bak

cd /usr/local/bin

ln -s python2.7 python

ln -s python2.7-config python-config

 

pyOpenSSL
 
 
 

  wget http://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.13.tar.gz#md5=767bca18a71178ca353dff9e10941929
 
 
 

  tar -zxvf pyOpenSSL-0.13.tar.gz
 
 
 

  cd pyOpenSSL-0.13
 
 
 

  sudo python setup.py install

测试是否安装成功


$python
 
 
 

  >>> import Crypto
 
 
 

  >>> import twisted.conch.ssh.transport
 
 
 

  >>> print Crypto.PublicKey.RSA
 
 
 

  <module 'Crypto.PublicKey.RSA' from '/usr/python/lib/python2.5/site-packages/Crypto/PublicKey/RSA.pyc'>
 
 
 

  >>> import OpenSSL 
 
 
 

  >>> import twisted.internet.ssl
 
 
 

  >>> twisted.internet.ssl
 
 
 

  <module 'twisted.internet.ssl' from '/usr/python/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/ssl.pyc'>


如果出现类似提示,说明pyOpenSSL模块已经安装成功了,否则,请检查上面的安装过程(OpenSSL需要pycrypto)。


 



># python >Python 2.6.6 (r266:84292, Dec 7 2011, 20:38:36) >[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2 >Type "help", "copyright", "credits" or "license" for more information. >>>>import OpenSSL >Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "OpenSSL/__init__.py", line 40, in <module> > from OpenSSL import crypto >ImportError: cannot import name crypto Notice that the complaint is about "OpenSSL/__init__.py" instead of something more sensible like "/usr/lib/python2.6/site- packages/OpenSSL/__init__.py". You're probably testing this using a working directory inside the pyOpenSSL source directory, and thus getting the wrong version of the OpenSSL package (one that does not include the built extension modules). Try testing in a different directory - or build the extension modules in-place using the -i option to distutils' build_ext command.



cd pyOpenSSL-0.13



cd ..


从pyOpenSSL-0.13 目录出去就不报错了



If this doesn't solve the problem, consider asking in a forum dedicated to CentOS 6 or pyOpenSSL, since the issue isn't really based on any software or other materials from the Twisted project. Also, include more information when you do so, for example a full installation transcript and a manifest of installed files, otherwise it's not likely anyone will be able to provide a better answer.



安装:easy_install 工具


sudo apt-get install python-setuptools
 
 
 

  w3lib
 
 
 

  sudo easy_install -U w3lib
 
 
 

   
 
 
 

  Scrapy
 
 
 

  wget http://pypi.python.org/packages/source/S/Scrapy/Scrapy0.14.3.tar.gz#md5=59f1225f7692f28fa0f78db3d34b3850
 
 
 

  tar -zxvf Scrapy-0.14.3.tar.gz
 
 
 

  cd Scrapy-0.14.3
 
 
 

  sudo python setup.py install


Scrapy安装验证



经过上面的安装和配置过程,已经完成了Scrapy的安装,我们可以通过如下命令行来验证一下:


$ scrapy
 
 
 

  Scrapy 0.14.3 - no active project



Usage:


scrapy <command> [options] [args]
 
 
 

   
 
 
 

  Available commands:
 
 
 

    fetch         Fetch a URL using the Scrapy downloader
 
 
 

    runspider     Run a self-contained spider (without creating a project)
 
 
 

    settings      Get settings values
 
 
 

    shell         Interactive scraping console
 
 
 

    startproject  Create new project
 
 
 

    version       Print Scrapy version
 
 
 

    view          Open URL in browser, as seen by Scrapy
 
 
 

   
 
 
 

  Use "scrapy <command> -h" to see more info about a command


上面提示信息,提供了一个fetch命令,这个命令抓取指定的网页,可以先看看fetch命令的帮助信息,如下所示:


$ scrapy fetch --help
 
 
 

  Usage
 
 
 

  =====
 
 
 

    scrapy fetch [options] <url>
 
 
 

   
 
 
 

  Fetch a URL using the Scrapy downloader and print its content to stdout. You
 
 
 

  may want to use --nolog to disable logging
 
 
 

   
 
 
 

  Options
 
 
 

  =======
 
 
 

  --help, -h              show this help message and exit
 
 
 

  --spider=SPIDER         use this spider
 
 
 

  --headers               print response HTTP headers instead of body
 
 
 

   
 
 
 

  Global Options
 
 
 

  --------------
 
 
 

  --logfile=FILE          log file. if omitted stderr will be used
 
 
 

  --loglevel=LEVEL, -L LEVEL
 
 
 

                          log level (default: DEBUG)
 
 
 

  --nolog                 disable logging completely
 
 
 

  --profile=FILE          write python cProfile stats to FILE
 
 
 

  --lsprof=FILE           write lsprof profiling stats to FILE
 
 
 

  --pidfile=FILE          write process ID to FILE
 
 
 

  --set=NAME=VALUE, -s NAME=VALUE
 
 
 

                          set/override setting (may be repeated)

                        



根据命令提示,指定一个URL,执行后抓取一个网页的数据,如下所示:


ubuntu[/home/ioslabs/scrapy]scrapy fetch 
  http://doc.scrapy.org/en/latest/intro/install.html > install.html
  
2012-07-19 11:11:34+0800 [scrapy] INFO: Scrapy 0.14.3 started (bot: scrapybot)
  
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
  
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
  
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
  
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled item pipelines:
  
2012-07-19 11:11:35+0800 [default] INFO: Spider opened
  
2012-07-19 11:11:35+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
  
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
  
2012-07-19 11:11:35+0800 [default] DEBUG: Crawled (200) <GET
  http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
  
2012-07-19 11:11:35+0800 [default] INFO: Closing spider (finished)
  
2012-07-19 11:11:35+0800 [default] INFO: Dumping spider stats:
  
        {'downloader/request_bytes': 227,
  
         'downloader/request_count': 1,
  
         'downloader/request_method_count/GET': 1,
  
         'downloader/response_bytes': 21943,
  
         'downloader/response_count': 1,
  
         'downloader/response_status_count/200': 1,
  
         'finish_reason': 'finished',
  
         'finish_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 902943),
  
         'scheduler/memory_enqueued': 1,
  
         'start_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 559084)}
  
2012-07-19 11:11:35+0800 [default] INFO: Spider closed (finished)
  
2012-07-19 11:11:35+0800 [scrapy] INFO: Dumping global stats:
  
        {'memusage/max': 23015424, 'memusage/startup': 23015424}

可见,我们已经成功抓取了一个网页。


根据scrapy官网的指南来进一步应用scrapy框架



Tutorial链接页面为 http://doc.scrapy.org/en/latest/intro/tutorial.html



http://media.readthedocs.org/pdf/scrapy/0.14/scrapy.pdf

http://baike.baidu.com/view/6687996.htm

标签:11,python,scrapy,Twisted,19,Scrapy,Ubuntu,2012
From: https://blog.51cto.com/u_2650279/7480470

相关文章

  • ubuntu中如何使用微信
    下载微信安装包地址:https://archive.ubuntukylin.com/ubuntukylin/pool/partner/weixin_2.1.4_amd64.deb  安装微信:sudodpkg-iweixin_2.1.4_amd64.deb 猛戳去隔壁:https://mp.csdn.net/mp_blog/manage/article?spm=1010.2135.3001.5448......
  • ssh用户限制(ubuntu)
    1.限制用户SSH登录只允许指定用户进行登录(白名单):在/etc/ssh/sshd_config配置文件中设置AllowUsers选项,(配置完成需要重启SSHD服务)格式如下:[email protected].*[email protected]#允许user1、192.168.5网段的user2和192.168.122.1的user3连接只拒绝......
  • 转载:Ubuntu 开机自动运行脚本(适用于Ubuntu20.04版本及之后)
    Ubuntu开机自动运行脚本(适用于Ubuntu20.04版本及之后)原文网址:Ubuntu20.04--开机自动运行脚本(命令)--方法/实例_IT利刃出鞘的博客-CSDN博客1.创建rc-local.service文件sudocp/lib/systemd/system/rc-local.service/etc/systemd/system然后修改/etc/systemd/system/rc-l......
  • ubuntu22.04.3 安装postgresql 16 rc1数据库
    ubuntu22.04.3安装postgresql16rc1数据库一、直接安装#Createthefilerepositoryconfiguration:sudosh-c'echo"debhttps://apt.postgresql.org/pub/repos/apt$(lsb_release-cs)-pgdgmain">/etc/apt/sources.list.d/pgdg.list'#Importthe......
  • ubuntu 安装 conda
    下载安装程序:在下载页面上,复制链接并使用wget命令下载Miniconda安装程序。请将链接替换为您选择的版本链接。例如: wgethttps://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh运行安装程序:运行下载的安装程序。首先,给安装程序添加执行权限: ......
  • 安装ubuntu22.04遇到的问题
    安装Ubuntu22.04遇到的问题镜像安装完成之后,无法开机,提示错误:AMD-V处于禁用状态​ 解决:进入bios中修改对应状态(3条消息)此主机支持AMD-V,但AMD-V处于禁用状态的解决办法_amdv处于禁用状态怎么办_素人岳的博客-CSDN博客更换国内镜像源:阿里源​ 解决:进入/etc/apt目录......
  • ubuntu22.04安装samba
    Ubuntu22.04安装samba使用samba-V命令查看虚拟机是否已经安装samba,如出现samba不是内部命令则需安装sudoaptinstallsamba:安装samba,如出现依赖问题安装失败,则有可能是镜像源与ubuntu版本不对应,修改镜像源。通过键入who查看当前用户,然后执行下面命令给该用户添加samba......
  • ubuntu编译ffmpeg扩展支持FLV-H265
    1.编译x264:  1)gitclonehttp://git.videolan.org/git/x264.git  2)./configure--enable-shared--disable-asm  3)make&&makeinstall2.编译x265:  1)wgethttp://ftp.videolan.org/pub/videolan/x265/x265-2.7.tar.bz2  2)tarxvfx265-2.7.tar.b......
  • Ubuntu 20.04 上安装和使用 Docker
    Ubuntu20.04上安装和使用Docker在Ubuntu上安装Docker非常直接。我们将会启用Docker软件源,导入GPGkey,并且安装软件包。一、开始安装首先,更新软件包索引,并且安装必要的依赖软件,来添加一个新的HTTPS软件源:sudoaptupdatesudoaptinstallapt-transport-httpsc......
  • VMware Ubuntu18.04找不到网卡ens33问题解决
     查询网卡状态[root@localhost~]# nmcli devicestatusDEVICE   TYPE     STATE      CONNECTIONens33    ethernet unmanaged  --lo        loopback unmanaged  --上面状态提示未接管 开启网络[root@localhost~]#nmcli......