创建scrapy
scrapy startproject 项目名称
创建蜘蛛(爬虫文件)
scrapy genspider 蜘蛛名称 网址
爬取网页(举百度的列子)
- 编写爬虫文件
import scrapy
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['baidu.com']
start_urls = ['http://www.baidu.com/']
def parse(self, response):
print(response.text)
还要改一下settings里的设置
# UA伪装(就是把爬虫文件伪装成为一个浏览器形式的访问)
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.42'
# robots.txt ,不遵守君子协议
ROBOTSTXT_OBEY = False
- 运行爬虫
1.窗口运行
scrapy crawl baidu
2.编写方法运行
main.py
from scrapy.cmdline import execute
execute('scrapy crawl baidu'.split())
红色的不是报错,而是日志