契子
最近发现在ip被tx及其他大量收购,强取的背景下,网络小说很难诞生出有高质量的作品,为此,我计划爬取某个网站100本(再多就暴了)比较优质往年的小说解决精神上的匮乏
构思
- 一个针对小说的下载函数rar_down
- 筛选优质小说的函数score_select: 根据网友投票和自己的判断条件选择是否爬取
实现
- 下载只需要用xpath爬取到下载地址,然后写入目录即可
为了保护网站服务器网址就不放了
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
def rar_down(id, g:int, d:int):
r = requests.get("http://baidu.php?id=" + str(id), headers = headers)
html = etree.HTML(r.text)
#名称处理
nd = html.xpath('/html/head/title/text()')[0]
name = ""
for x in nd:
if x == '《': continue
else:
if x == '》': break
else: name += x
name = str(g) + '_' + str(d) + '_' + name + '.rar'
# rar链接
dw = html.xpath('//span[@class="downfile"]')
durl = etree.tostring(dw[0]).decode('utf-8')[32: -53]
rar = r = requests.get(durl, headers = headers)
with open('./' + name ,'wb') as code:
code.write(rar.content)
print(name + " down_complete!!!\n")
- 选取: 在筛选评分的时候发现评分是js动态加载的,requests无法得到,试过requests-html也没有用,求助同学,突然想起来selenium可以动态加载(
虽然时间复杂度肯定会变大了)
解决思路:用上次学校打卡使用的selenimu+无可视化+find_element对得分获取与加工
def score_select(id):
# 实现无可视化操作
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
#chrome_options.add_argument("--no-sandbox")
bro = webdriver.Chrome(chrome_options=chrome_options)
bro.get("http://baidu/post/" + str(id))
sleep(1)
if '不要返回吗?' in bro.page_source: return
good = int(bro.find_element(by=By.ID, value='moodinfo0').text)#find_element_by_id('moodinfo0').text)
bad = int(bro.find_element(by=By.ID, value='moodinfo4').text)
if good + bad < 50: return
if (float(good) / bad) < 1.0 : return
rar_down(id, good, bad)
global tol
tol += 1
bro.quit()
- 最后是导入的包和main函数筛选代码
# -*- coding: UTF-8 -*-
import os
import requests
import lxml
from lxml import etree
from requests_html import HTMLSession
from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import random
tol = 0
if __name__ == '__main__':
st = [False] * 20001
while tol < 100:
id = random.randint(1, 20000)
if st[id]: continue
st[id] = True
score_select(id)
print("-----------------over------------")