1、获取小说标题、详情页链接
url = 'https://www.bqg99.com/book/109323/'
list_html = requests.get(url=url,headers=headers)
selector =etree.HTML(list_html.text)
lis =selector.xpath('/html/body/div[@class="listmain"]//dd/a/@href') #提取所有章节页
title = selector.xpath('/html/body/div/span/text()')[0]
2、构造详情页链接并处理成列表形式
emptylist = []
for i in lis:
href_list = "https://www.bqg99.com" + i
# print(href_list)
emptylist.append(href_list)
emptylist.remove(emptylist[10]) #第11个网页不是我们想要的网页链接
3、访问小说每一章内容,获取数据并下载
for li in emptylist:
req = requests.get(url=li,headers=headers)
sel = etree.HTML(req.text)
content = sel.xpath('//*[@id="chaptercontent"]/text()')
chapter =sel.xpath('//*[@id="read"]/div/span/text()')[0]
content = '\n'.join(content) #用换行符\n 拼接列表
content = content.replace('请收藏本站:https://www.bqg99.com。笔趣阁手机版:https://m.bqg99.com ','')
this_chapter =f'\n{chapter}\n{content}'
with open(file=file_name,mode='a',encoding='UTF-8') as f:
f.write(this_chapter)
print(f'{chapter}--下载完成!') #打印下载
4、代码
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://www.bqg99.com/book/109323/'
list_html = requests.get(url=url,headers=headers)
selector =etree.HTML(list_html.text)
lis =selector.xpath('/html/body/div[@class="listmain"]//dd/a/@href') #提取所有章节页
title = selector.xpath('/html/body/div/span/text()')[0]
file_name = f'小说/{title}.txt' #定义本地存储名称
emptylist = []
for i in lis:
href_list = "https://www.bqg99.com" + i
# print(href_list)
emptylist.append(href_list)
emptylist.remove(emptylist[10]) #第11个网页不是我们想要的网页链接
# print(emptylist)
for li in emptylist:
req = requests.get(url=li,headers=headers)
sel = etree.HTML(req.text)
content = sel.xpath('//*[@id="chaptercontent"]/text()')
chapter =sel.xpath('//*[@id="read"]/div/span/text()')[0]
content = '\n'.join(content) #用换行符\n 拼接列表
content = content.replace('请收藏本站:https://www.bqg99.com。笔趣阁手机版:https://m.bqg99.com ','')
this_chapter =f'\n{chapter}\n{content}'
with open(file=file_name,mode='a',encoding='UTF-8') as f:
f.write(this_chapter)
print(f'{chapter}--下载完成!') #打印下载
5、总结
a.详情页没处理成列表导致报错
b.用etree接受数据导致报错
标签:xpath,content,emptylist,bqg99,text,chapter,list,爬取,笔趣 From: https://blog.51cto.com/u_15698082/5871301