以起点小说网举例子
网址
https://www.qidian.com/rank/yuepiao/
默认大家已经生成好scrapy项目了,如果不会请参考我之前的文章scrapy框架之创建项目运行爬虫
爬取网页
- 获取元素位置
- 编写爬虫文件
1.先在控制台打印一下,查看是否爬取成功
import scrapy
class QidianspiderSpider(scrapy.Spider):
name = 'qidianspider'
allowed_domains = ['www.qidian.com']
start_urls = ['https://www.qidian.com/rank/yuepiao/']
def parse(self, response):
names = response.xpath('//*[@id="book-img-text"]/ul/li[1]/div[2]/h2/a/text()').extract()
authors = response.xpath('//*[@id="book-img-text"]/ul/li[1]/div[2]/p[2]/text()').extract()
print(names)
print(authors)
代码介绍:response.xpath()里写入xpath路径
extract()可以把返回数据取出杂余标签
注意://[@id="book-img-text"]/ul/li[1]/div[2]/h2/a/text()中吧h2删除就不会只去获取h2下的a标签了,而是所有的div[2]下的a标签(在xpath中//代表多层路径,可以用来省略路径)改成//[@id="book-img-text"]/ul/li[1]/div[2]//a/text()第二个同理。
2.循环取出数据并存到json文件中
import scrapy
class QidianspiderSpider(scrapy.Spider):
name = 'qidianspider'
allowed_domains = ['www.qidian.com']
start_urls = ['https://www.qidian.com/rank/yuepiao/']
def parse(self, response):
names = response.xpath('//*[@id="book-img-text"]/ul/li[1]/div[2]//a/text()').extract()
authors = response.xpath('//*[@id="book-img-text"]/ul/li[1]/div[2]//text()').extract()
book=[]
for name,author in zip(names,authors):
book.append({'name':name, 'author':author})
return book
运行爬虫
scrapy crawl qidianspider -o yy.xml
代码介绍:-o 表示存储 后面更文件名,支持json,xml,csv文件存储
标签:xml,文件,xpath,text,scrapy,json,book,div,response From: https://www.cnblogs.com/yousuobutong/p/16720789.html