首页 > 其他分享 >爬取小说(编辑推荐,完本榜单)

爬取小说(编辑推荐,完本榜单)

时间:2022-11-03 15:45:18浏览次数:67  
标签:榜单 url text 完本 list 爬取 soup fan find

image
image

import requests
import bs4
import re
import pandas as pd
import xlwt


#
# def l():
def heavy_recommendation():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
    url = 'https://www.17k.com/quanben/'
    response = requests.get(url=url, headers=headers)
    response.encoding = 'utf-8'
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    list_total = []
    list1 = []
    list_word = []
    list_fan = []

    li1 = soup.find('ul', attrs={'class': 'BJTJ_CONT Top1'})
    li1_list = li1.find_all('li')
    for item in li1_list:
        url_book = item.find('a').get('href')
        url_book = url_book.replace('//', 'https://')
        url_book1 = url_book.replace('	', '')
        list1.append(url_book1)
    for i in range(0, 16):
        url2 = list1[i]
        dict1 = {}
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

        response = requests.get(url=url2, headers=headers)
        response.encoding = 'utf-8'
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        reader = soup.find('em', attrs={'class': 'blue'}).text
        word = soup.find('em', attrs={'class': 'red'}).text
        # print(reader.text)
        # print(word.text)
        name1 = soup.find('h1')
        name2 = name1.find('a').text
        writer = soup.find('a', attrs={'class': 'name'}).text

        fan = soup.find('span', attrs={'id': 'fansScore'}).text
        fan = fan.replace('万', "0000")
        fan = fan.replace('.', '')
        recommender = soup.find('span', attrs={'id': 'recommentCount'}).text
        # print(fan.text)
        # print(recommender.text)
        # print(writer.text)
        dict1['小说名字'] = name2
        dict1['作者'] = writer
        dict1['粉丝数'] = int(fan)
        dict1['阅读数'] = reader
        dict1['小说字数'] = int(word)
        dict1['推荐票数'] = recommender
        list_word.append(word)
        list_fan.append(fan)
        list_total.append(dict1)
        # print(list_word)
        # print(list_fan)
        df = pd.DataFrame(list_total)
        # print(df)
        # print("over!-----------------------------------------------------------------")

    df2 = df.sort_values(by=["小说字数"], ascending=[False], kind="stable")
    df3 = df.sort_values(by=["粉丝数"], ascending=[False], kind='stable')
    df2.to_excel('heavy_recommendation1.xls')
    df3.to_excel('heavy_recommendation2.xls')

    # print(df.head())
    # # print(df)

    # for i in range(0,15):
    #     flag = i
    #     for j in range(i + 1, 16):
    #         if int(list_fan[i]) < int(list_fan[j]):
    #             flag = j
    #             t = int(list_fan[i])
    #             list_fan[i] = int(list_fan[j])
    #             list_fan[j] = t


def Great_potential():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
    url = 'https://www.17k.com/quanben/'
    response = requests.get(url=url, headers=headers)
    response.encoding = 'utf-8'
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    list_total = []
    list1 = []
    list_word = []
    list_fan = []

    li1 = soup.find_all('ul', attrs={'class': 'BJTJ_CONT Top1'})[1]
    li1_list = li1.find_all('li')
    for item in li1_list:
        url_book = item.find('a').get('href')
        url_book = url_book.replace('//', 'https://')
        url_book1 = url_book.replace('	', '')
        list1.append(url_book1)
    for i in range(0, 16):
        url2 = list1[i]
        dict1 = {}
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

        response = requests.get(url=url2, headers=headers)
        response.encoding = 'utf-8'
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        reader = soup.find('em', attrs={'class': 'blue'}).text
        word = soup.find('em', attrs={'class': 'red'}).text
        name1 = soup.find('h1')
        name2 = name1.find('a').text
        # print(reader.text)
        # print(word.text)
        writer = soup.find('a', attrs={'class': 'name'}).text
        fan = soup.find('span', attrs={'id': 'fansScore'}).text
        fan1 = fan.replace('.', '')
        fan2 = fan1.replace('万', '0000')
        recommender = soup.find('span', attrs={'id': 'recommentCount'}).text
        # print(fan.text)
        # print(recommender.text)
        # print(writer.text)
        dict1['小说名字'] = name2
        dict1['阅读数'] = reader
        dict1['小说字数'] = int(word)
        dict1['作者'] = writer
        dict1['粉丝数'] = int(fan2)
        dict1['推荐票数'] = recommender
        list_total.append(dict1)
    df = pd.DataFrame(list_total)
    # print(df)
    # print("over!-----------------------------------------------------------------")
    df.to_excel('Great_potential.xls')

    df2 = df.sort_values(by=["小说字数"], ascending=[False], kind="stable")
    df3 = df.sort_values(by=["粉丝数"], ascending=[False], kind='stable')
    df2.to_excel('Great_potential11.xls')
    df3.to_excel('Great_potential12.xls')


def Boys_finished_the_book():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
    url = 'https://www.17k.com/top/refactor/top100/18_popularityListScore/18_popularityListScore_finishBook_top_100_pc.html?TabIndex=1&typeIndex=0'
    response = requests.get(url=url, headers=headers)
    response.encoding = 'utf-8'
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    list_total = []
    list1 = []
    list_word = []
    list_fan = []

    li1 = soup.find_all('a', attrs={'class': 'red'})

    for item in li1:
        url_book = item.get('href')
        url_book1 = url_book.replace('//', 'https://')
        list1.append(url_book1)
    for i in range(100):
        url2 = list1[i]
        dict1 = {}
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

        response = requests.get(url=url2, headers=headers)
        response.encoding = 'utf-8'
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        reader = soup.find('em', attrs={'class': 'blue'}).text
        word = soup.find('em', attrs={'class': 'red'}).text
        # print(reader.text)
        # print(word.text)
        name = soup.find('a', attrs={'class': 'red'}).text
        writer = soup.find('a', attrs={'class': 'name'}).text
        fan = soup.find('span', attrs={'id': 'fansScore'}).text
        fan = fan.replace('.', '')
        fan = fan.replace('万', '0000')
        recommender = soup.find('span', attrs={'id': 'recommentCount'}).text
        # print(fan.text)
        # print(recommender.text)
        # print(writer.text)
        dict1['小说名称'] = name
        dict1['阅读数'] = reader
        dict1['小说字数'] = int(word)
        dict1['作者'] = writer
        dict1['粉丝数'] = int(fan)
        dict1['推荐票数'] = recommender
        list_total.append(dict1)
    df = pd.DataFrame(list_total)
    df2 = df.sort_values(by=["小说字数"], ascending=[False], kind="stable")
    df3 = df.sort_values(by=["粉丝数"], ascending=[False], kind='stable')
    df2.to_excel('Boys_finished_the_book1.xls')
    df3.to_excel('Boys_finished_the_book2.xls')


def Girls_finished_the_book():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
    url = 'https://www.17k.com/top/refactor/top100/18_popularityListScore/18_popularityListScore_finishBook_top_100_pc.html?TabIndex=1&typeIndex=0'
    response = requests.get(url=url, headers=headers)
    response.encoding = 'utf-8'
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    list_total = []
    list1 = []
    list_word = []
    list_fan = []

    li1 = soup.find_all('a', attrs={'class': 'red'})

    for item in li1:
        url_book = item.get('href')
        url_book1 = url_book.replace('//', 'https://')
        list1.append(url_book1)
    for i in range(100):
        url2 = list1[i]
        dict1 = {}
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

        response = requests.get(url=url2, headers=headers)
        response.encoding = 'utf-8'
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        reader = soup.find('em', attrs={'class': 'blue'}).text
        word = soup.find('em', attrs={'class': 'red'}).text
        # print(reader.text)
        # print(word.text)
        name = soup.find('a', attrs={'class': 'red'}).text
        writer = soup.find('a', attrs={'class': 'name'}).text
        fan = soup.find('span', attrs={'id': 'fansScore'}).text
        fan = fan.replace('.', '')
        fan = fan.replace('万', '0000')
        recommender = soup.find('span', attrs={'id': 'recommentCount'}).text
        # print(fan.text)
        # print(recommender.text)
        # print(writer.text)
        dict1['小说名称'] = name
        dict1['阅读数'] = reader
        dict1['小说字数'] = int(word)
        dict1['作者'] = writer
        dict1['粉丝数'] = int(fan)
        dict1['推荐票数'] = recommender
        list_total.append(dict1)
    df = pd.DataFrame(list_total)

    df2 = df.sort_values(by=["小说字数"], ascending=[False], kind="stable")
    df3 = df.sort_values(by=["粉丝数"], ascending=[False], kind='stable')
    df2.to_excel('Boys_finished_the_book1.xls')
    df3.to_excel('Boys_finished_the_book2.xls')


def Finish_this_list():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
    url = 'https://www.17k.com/top/refactor/top100/18_popularityListScore/18_popularityListScore_finishBook_top_100_pc.html?TabIndex=1&typeIndex=0'
    response = requests.get(url=url, headers=headers)
    response.encoding = 'utf-8'
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    list_total = []
    list1 = []
    list_word = []
    list_fan = []

    li1 = soup.find_all('a', attrs={'class': 'red'})

    for item in li1:
        url_book = item.get('href')
        url_book1 = url_book.replace('//', 'https://')
        list1.append(url_book1)
    for i in range(100):
        url2 = list1[i]
        dict1 = {}
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}

        response = requests.get(url=url2, headers=headers)
        response.encoding = 'utf-8'
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        reader = soup.find('em', attrs={'class': 'blue'}).text
        word = soup.find('em', attrs={'class': 'red'}).text
        # print(reader.text)
        # print(word.text)
        writer = soup.find('a', attrs={'class': 'name'}).text
        name = soup.find('a', attrs={'class': 'red'}).text
        fan = soup.find('span', attrs={'id': 'fansScore'}).text
        fan = fan.replace('.', '')
        fan = fan.replace('万', '0000')
        recommender = soup.find('span', attrs={'id': 'recommentCount'}).text
        # print(fan.text)
        # print(recommender.text)
        # print(writer.text)
        dict1['小说名字'] = name
        dict1['作者'] = writer
        dict1['粉丝数'] = int(fan)
        dict1['阅读数'] = reader
        dict1['小说字数'] = int(word)
        dict1['推荐票数'] = recommender

        list_total.append(dict1)
    df = pd.DataFrame(list_total)
    # print(df)
    # print("over!-----------------------------------------------------------------")
    df2 = df.sort_values(by=["小说字数"], ascending=[False], kind="stable")
    df3 = df.sort_values(by=["粉丝数"], ascending=[False], kind='stable')
    df2.to_excel('Finish_this_list1.xls')
    df3.to_excel('Finish_this_list2.xls')


if __name__ == '__main__':
    li = []
    heavy_recommendation()
    Great_potential()
    Girls_finished_the_book()
    Finish_this_list()
    Boys_finished_the_book()






成功执行后,会生成以下文件:

image

image

image

标签:榜单,url,text,完本,list,爬取,soup,fan,find
From: https://www.cnblogs.com/JK8395/p/16854674.html

相关文章

  • 巨杉数据库入围 IDC Innovator榜单,获评分布式数据库创新者
    近日,巨杉数据库凭借「湖仓一体」分布式数据库在金融领域的创新应用获得 IDC Innovator中国分布式数据库创新者殊荣。值得一提的是,这也是IDC在数字化转型盛典会议中首次......
  • 美团民宿数据爬取
    1.美团民宿信息获取#coding:utf8importrequestsimportrandomfromlxmlimportetreeimporttime#提供ua信息的的包#fromuainfoimportua_listimportpymy......
  • python爬取公众号文章发布时间
    使用xpath取出来的是空,爬取到本地的html,时间的标签如下,内容也是是空的<emid="publish_time"class="rich_media_metarich_media_meta_text"></em>经过查找发现网页使......
  • 小说网页内容爬取
    近来闲来无事,看小说的时候发现都是垃圾流氓广告,突发奇想要不自己把小说内容给爬取下来?说干就干1、简介:所谓小说爬取无非就是对请求返回来的html内容进行解析获取到自己想......
  • 【C#】爬取百度贴吧帖子 通过贴吧名和搜索关键词
    背景:最近喜欢看百度贴吧,因为其内容大多都是吧友的真实想法表达等等原因。但是通过网页去浏览贴吧,始终觉得不够简介,浏览帖子的效率不高,自己就萌发了通过自己爬取贴吧感兴趣......
  • 爬取小说并拷贝到为xls格式
    importrequestsimportbs4importpandasaspddefl():foriinrange(30):dict={}book=soup.find_all('a',attrs={'class':'jt'})[i].tex......
  • python 爬虫 -----Bs4 爬取并且下载图片
    #1.拿到主页面主代码,拿到子页面连接地址,href#2.通过href拿到子页面内容,从子页面中找到图片的下载地址img->src#3.下载图片importrequestsfrombs4importBea......
  • 求大神解答:利用python爬取各县GDP结果为空,求大神看看我的代码问题在哪?
    目标url=红黑人口库代码importrequestsfromlxmlimportetreeimporttimeif__name__=='__main__':  url='https://pagead2.googlesyndication.com/getconfig/soda......
  • 爬取淘宝女装并可视化分析
    这次主要是爬虫实战+数据可视化分析:爬虫针对是淘宝的女装信息详细代码数据可以到我的gitee下载:爬取淘宝女装并可视化分析:基于爬虫,获取淘宝的商品信息,保存本地并进行可视......
  • Python爬取照片
    实例:爬取内蒙古科技大学校徽打开网站      1.引入requests模块   2.输入要请求的网站url   网址获取 3.发送请求头  user-agent的获......