Ubuntu 10.04 安装Twisted、Scrapy爬虫框架
Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结 Scrapy Python爬虫框架 logo[1]构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。Scrapy吸引人的地方在于它是一个框架,任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类,如BaseSpider、sitemap爬虫等,最新版本又提供了web2.0爬虫的支持。
Python 2.5, 2.6, 2.7 (3.x is not yet supported)
Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
simplejson (not required if using Python 2.6 or above)
pyopenssl (for HTTPS support. Optional, but highly recommended)
sudo apt-get install python-twisted python-libxml2 python-simplejson
wget http://pypi.python.org/packages/source/p/pycrypto/pycrypto-2.5.tar.gz#md5=783e45d4a1a309e03ab378b00f97b291
tar -zxvf pycrypto-2.5.tar.gz
cd pycrypto-2.5
sudo python setup.py install
/etc/host,/etc/hostname 要一致,否则报错
python版本:2.6.5 更新一下,否则报gcc返回状态不对
sudo apt-get install python-dev
wget http://pypi.python.org/packages/source/p/pycrypto/pycrypto-2.5.tar.gz#md5=783e45d4a1a309e03ab378b00f97b291
tar -zxvf pycrypto-2.5.tar.gz
cd pycrypto-2.5
sudo python setup.py install
当python2.6.5 时安装
warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath. ubuntu
python更新成 2.7版本后,警告消失
wget -c http://www.python.org/ftp/python/2.7/Python-2.7.tar.bz2
tar -xvjpf Python-2.7.tar.bz2
cd Python-2.7
sudo make altinstall
cd /usr/bin
mv python python.bak
mv python-config python-config.bak
mv python2 python2.bak
cd /usr/local/bin
ln -s python2.7 python
ln -s python2.7-config python-config
wget http://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.13.tar.gz#md5=767bca18a71178ca353dff9e10941929
tar -zxvf pyOpenSSL-0.13.tar.gz
cd pyOpenSSL-0.13
sudo python setup.py install
>>> import Crypto
>>> import twisted.conch.ssh.transport
>>> print Crypto.PublicKey.RSA
<module 'Crypto.PublicKey.RSA' from '/usr/python/lib/python2.5/site-packages/Crypto/PublicKey/RSA.pyc'>
>>> import OpenSSL
>>> import twisted.internet.ssl
>>> twisted.internet.ssl
<module 'twisted.internet.ssl' from '/usr/python/lib/python2.5/site-packages/Twisted-10.1.0-py2.5-linux-i686.egg/twisted/internet/ssl.pyc'>
># python >Python 2.6.6 (r266:84292, Dec 7 2011, 20:38:36) >[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2 >Type "help", "copyright", "credits" or "license" for more information. >>>>import OpenSSL >Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "OpenSSL/__init__.py", line 40, in <module> > from OpenSSL import crypto >ImportError: cannot import name crypto Notice that the complaint is about "OpenSSL/__init__.py" instead of something more sensible like "/usr/lib/python2.6/site- packages/OpenSSL/__init__.py". You're probably testing this using a working directory inside the pyOpenSSL source directory, and thus getting the wrong version of the OpenSSL package (one that does not include the built extension modules). Try testing in a different directory - or build the extension modules in-place using the -i option to distutils' build_ext command.
cd pyOpenSSL-0.13
cd ..
从pyOpenSSL-0.13 目录出去就不报错了
If this doesn't solve the problem, consider asking in a forum dedicated to CentOS 6 or pyOpenSSL, since the issue isn't really based on any software or other materials from the Twisted project. Also, include more information when you do so, for example a full installation transcript and a manifest of installed files, otherwise it's not likely anyone will be able to provide a better answer.
安装:easy_install 工具
sudo apt-get install python-setuptools
sudo easy_install -U w3lib
wget http://pypi.python.org/packages/source/S/Scrapy/Scrapy0.14.3.tar.gz#md5=59f1225f7692f28fa0f78db3d34b3850
tar -zxvf Scrapy-0.14.3.tar.gz
cd Scrapy-0.14.3
sudo python setup.py install
$ scrapy
Scrapy 0.14.3 - no active project
scrapy <command> [options] [args]
Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
$ scrapy fetch --help
scrapy fetch [options] <url>
Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging
--help, -h show this help message and exit
--spider=SPIDER use this spider
--headers print response HTTP headers instead of body
Global Options
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--lsprof=FILE write lsprof profiling stats to FILE
--pidfile=FILE write process ID to FILE
set/override setting (may be repeated)
ubuntu[/home/ioslabs/scrapy]scrapy fetch
http://doc.scrapy.org/en/latest/intro/install.html > install.html
2012-07-19 11:11:34+0800 [scrapy] INFO: Scrapy 0.14.3 started (bot: scrapybot)
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Enabled item pipelines:
2012-07-19 11:11:35+0800 [default] INFO: Spider opened
2012-07-19 11:11:35+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Telnet console listening on
2012-07-19 11:11:35+0800 [scrapy] DEBUG: Web service listening on
2012-07-19 11:11:35+0800 [default] DEBUG: Crawled (200) <GET
http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
2012-07-19 11:11:35+0800 [default] INFO: Closing spider (finished)
2012-07-19 11:11:35+0800 [default] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 21943,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 902943),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 7, 19, 3, 11, 35, 559084)}
2012-07-19 11:11:35+0800 [default] INFO: Spider closed (finished)
2012-07-19 11:11:35+0800 [scrapy] INFO: Dumping global stats:
{'memusage/max': 23015424, 'memusage/startup': 23015424}
Tutorial链接页面为 http://doc.scrapy.org/en/latest/intro/tutorial.html
标签:11,python,scrapy,Twisted,19,Scrapy,Ubuntu,2012 From: https://blog.51cto.com/u_2650279/7480470