首页 > 其他分享 >rust 爬取笔趣阁生成epub文件

rust 爬取笔趣阁生成epub文件

时间:2024-06-20 17:32:45浏览次数:26  
标签:chapter use unwrap self 爬取 href let 笔趣 rust

简单研究一下epub,毕竟txt总是看着不爽,后面在优化epub样式

cargo.toml
[package]
name = "bqg_epub"
version = "0.1.0"
edition = "2021"

[dependencies]
epub-builder = "0.7.4"
reqwest = {version = "0.12.5",features = ["blocking","json"]}
tokio = {version = "1.38.0",features = ["full"]}
scraper ="0.19.0"
rand = { version = "0.8.5", features = ["default"] }
url = "2.5.2"
clap = {version = "4.5.7",features = ["derive"]}
main.rs
use std::cmp::Ordering;

use std::fs::{File, OpenOptions};
use epub_builder::EpubBuilder;
use epub_builder::Result;
use epub_builder::ZipLibrary;
use epub_builder::EpubContent;
use epub_builder::ReferenceType;

use std::io::Write;
use std::path::Path;
use std::{fs, io};
use std::time::Duration;
use clap::Parser;
use reqwest::{Client, Url};
use scraper::{Html, Selector};
use rand::{Rng};


#[derive(Debug)]
struct Book {
    title: String,
    homepage: String,
    intro: String,
    author: String,

    chapters: Vec<Chapter>,
}

impl Book {
    fn new(homepage: &str) -> Self {
        Self {
            title: String::default(),
            author: String::default(),
            intro: String::default(),
            chapters: Vec::new(),
            homepage: homepage.to_string(),
        }
    }
    async fn get_book_info(&mut self, text: &str) -> Result<()> {
        let mut chapters = vec![];
        let document = Html::parse_document(&text);
        let chapter_selector = Selector::parse("#list > dl > dd > a").unwrap();
        let author_selector = Selector::parse("#info > p:nth-child(2) > a").unwrap();
        let intro_selector = Selector::parse("#intro").unwrap();
        let title_selector = Selector::parse("#info > h1").unwrap();

        self.author = document.select(&author_selector).next().unwrap().text().collect::<Vec<_>>().join(" ");
        self.intro = document.select(&intro_selector).next().unwrap().text().collect::<Vec<_>>().join(" ");
        self.title = document.select(&title_selector).next().unwrap().text().collect::<Vec<_>>().join(" ");

        for element in document.select(&chapter_selector) {
            if let Some(href) = element.value().attr("href") {
                let text = element.text().collect::<Vec<_>>().join(" ");

                let c = Chapter::new(href, &text);
                // chapters.push(c);
                self.add_chapter(c);
            }
        }
        chapters.sort();
        self.chapters = chapters;
        Ok(())
    }

    fn add_chapter(&mut self,chapter: Chapter){

        if !self.chapters.iter().any(|c| c.href == chapter.href){
            self.chapters.push(chapter)
        }
    }


    fn generate_epub(&self) -> Result<()> {
        // let mut output = Vec::<u8>::new();

        let dummy_image = "Not really a PNG image";
        let dummy_css = "body { background-color: pink }";
        let mut output = File::create(format!("{}.epub", "test")).unwrap();

        let zip_lib = ZipLibrary::new()?;
        // Create a new EpubBuilder using the zip library
        let mut builder = EpubBuilder::new(zip_lib)?;
        builder
            // Set some metadata
            .metadata("author", "Leon Lee")?
            .metadata("title", &self.title)?
            .add_cover_image("cover.png", dummy_image.as_bytes(), "image/png")?
            // Add a resource that is not part of the linear document structure
            .add_resource("some_image.png", dummy_image.as_bytes(), "image/png")?;

        for chapter in self.chapters.iter() {
            builder.add_content(EpubContent::new(&chapter.href, &*chapter.content.as_bytes())
                .title(&chapter.title)
                .reftype(ReferenceType::TitlePage))?;
        }

        builder.inline_toc()
            // Finally, write the EPUB file to a writer. It could be a `Vec<u8>`, a file,
            // `stdout` or whatever you like, it just needs to implement the `std::io::Write` trait.
            .generate(&mut output)?;


        Ok(())
    }
}

// const BASE_URL: &str = "https://www.xbiqugew.com/book/53099/";
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
#[command(next_line_help = true)]
struct Args {
    /// base_url
    #[arg(short, long)]
    url: String,
}


#[tokio::main]
async fn main() -> Result<()> {
    let args = Args::parse();
    let client = Client::builder()
        .build()?;
    let html = query_book_homepage(&client, &args.url).await.unwrap();

    let mut book = Book::new(&args.url);
    book.get_book_info(&html).await?;

    for chapter in book.chapters.iter_mut() {
        println!("{}  | {}", chapter.href, chapter.title);
        let delay = random_delay();
        println!("Waiting for {} milliseconds before the next request...", delay.as_millis());
        tokio::time::sleep(delay).await;
        chapter.scraper_chapter_content(&book.homepage, &client).await.unwrap()
    }
    book.generate_epub().unwrap();

    println!("{:?}",book);
    Ok(())
}

/// test request page
async fn query_book_homepage(client: &Client, homepage: &str) -> Result<String> {
    let html = client.get(homepage).send().await?.text().await?;
    println!("scraper homepage: {} done!", homepage);
    Ok(html)
}

#[derive(Eq,Debug)]
struct Chapter {
    number: usize,
    href: String,
    title: String,
    content: String,
}

impl Chapter {
    fn new(href: &str, title: &str) -> Self {
        let number = href.split('.').next().unwrap_or("0").parse::<usize>().unwrap();
        Self {
            number,
            href: href.to_string(),
            title: title.to_string(),
            content: String::default(),
        }
    }

    async fn scraper_chapter_content(&mut self, base_url: &str, client: &Client) -> Result<()> {
        // let v = (rand::random::<f64>() * 5000.) as u64 ;
        //
        // let sleep_time = std::time::Duration::from_millis(v);
        let base_url = Url::parse(base_url)?;
        let joined_url = base_url.join(&self.href)?;

        println!("now visited: {}", joined_url);

        let page = client.get(joined_url).send().await?.text().await?;
        let document = Html::parse_document(&page);
        let content_selector = Selector::parse("#content").unwrap();

        let content = match document.select(&content_selector).next() {
            Some(e) => {
                e.text().collect::<Vec<_>>().join("\r\n")
            }
            None => { "this chapter may have no content or an error occur".to_string() }
        };

        let file_name = format!("books/{}.txt", self.href.split('.').next().unwrap_or("0").parse::<usize>().unwrap());
        let dir_path = Path::new(&file_name).parent().unwrap(); // Get the directory part of the file path

        check_and_create_directory(dir_path)?;
        let mut file = OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(file_name).unwrap();

        let cleaned = replace_html_entities(&content);
        file.write(cleaned.as_bytes()).unwrap();

        self.content = cleaned;
        Ok(())
    }
}

impl PartialEq for Chapter {
    fn eq(&self, other: &Self) -> bool {
        self.href == other.href
    }
}

impl PartialOrd for Chapter {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}

impl Ord for Chapter {
    fn cmp(&self, other: &Self) -> Ordering {
        self.number.cmp(&other.number)
    }
}


fn check_and_create_directory(dir_path: &Path) -> io::Result<()> {
    if !dir_path.exists() {
        println!("Directory does not exist. Creating directory: {:?}", dir_path);
        fs::create_dir_all(dir_path)?; // Create the directory and any missing parent directories
    } else {
        println!("Directory already exists: {:?}", dir_path);
    }
    Ok(())
}

fn random_delay() -> Duration {
    let mut rng = rand::thread_rng();
    let millis = rng.gen_range(500..2000); // Random delay between 500ms and 2000ms
    Duration::from_millis(millis)
}

fn replace_html_entities(s: &str) -> String {
    s.replace("&nbsp;", "")
        .replace("&amp;", "&")
        .replace("&lt;", "<")
        .replace("&gt;", ">")
    // .replace(" "," ")
    // Add more replacements as needed
}

标签:chapter,use,unwrap,self,爬取,href,let,笔趣,rust
From: https://www.cnblogs.com/itachilee/p/18259103

相关文章

  • 2748. 美丽下标对的数目(Rust暴力枚举)
    题目给你一个下标从0开始的整数数组nums。如果下标对i、j满足0≤i<j<nums.length,如果nums[i]的第一个数字和nums[j]的最后一个数字互质,则认为nums[i]和nums[j]是一组美丽下标对。返回nums中美丽下标对的总数目。对于两个整数x和y,如......
  • 如何使用python脚本爬取微信公众号文章
    1、什么是爬虫?在座的各位可能经常听到一个词,叫“爬虫”,这是一种能够悄无声息地将网站数据下载至本地设备的程序。利用爬虫,您无需亲自访问特定网站,逐个点击并手动下载所需数据。相反,爬虫能够全自动地为您完成这项任务,从网站上获取您所需的信息,并将其下载到您的设备上,而整个过程......
  • Rust 交叉编译环境搭建
    一、安装Rust1.官方安装$curl--proto'=https'--tlsv1.2https://sh.rustup.rs-sSf|sh安装时可能存在流量不稳定导致安装失败,可以更换源,使用国内的crates.io镜像。2.更换Rust镜像源进行安装(1)更换Rustup镜像源     修改~/.bashrc,追加如下内容exportR......
  • Rust中 测试用例编写
    //注定会断言失败的代码:断言1和2会不会相等#[cfg(test)]modtests{usesuper::*;#[test]fnone_result(){assert_eq!(1,2);}}注意点 1.编程环境:vscode+rust-analyzer(插件式)2.方法上添加标签(Attribute):#[cfg(test)]3.断言语句:asser......
  • 量化交易:Dual Thrust策略
    哈喽,大家好,我是木头左!DualThrust策略起源于20世纪80年代,由美国著名交易员和金融作家LarryWilliams首次提出。这一策略的核心思想是通过捕捉市场中的短期波动来实现盈利。LarryWilliams通过多年的研究和实践,发现市场中存在一种周期性的波动模式,通过这种模式可以预测价格的短......
  • Rust性能分析之测试及火焰图,附(lru,lfu,arc)测试
    性能测试,在编写代码后,单元测试及性能测试是重要的验收点,好的性能测试可以让我们提前发现程序中存在的问题。测试用例在Rust中,测试通常有两部分,一部分是文档测试,一部分是模块测试。通常我们在函数定义的开始可以看到以///三斜杠开头的就是文档注释发布的时候会将自动生成到docs.......
  • 用Xpath制作简单的爬取网页工具,获取神奇宝贝百科每只精灵的信息
    最近开始学习Python的爬虫应用,个人比较喜欢用Xpath的方式来爬取数据,今天就结合一下Xpath方式,以“神奇宝贝百科”为素材,制作一个爬取每只宝可梦数据的工程项目准备工作神奇宝贝百科地址:https://wiki.52poke.com/wiki/主页工程项目的目标是,获取每只精灵的名字、编号、属性、特性......
  • python爬取数据爬取图书信息
    #encoding=utf-8importjson#json包,用于读取解析,生成json格式的文件内容importtimefromrandomimportrandintimportrequests#请求包用于发起网络请求frombs4importBeautifulSoup#解析页面内容帮助包fromlxmlimportetreeimportre#正则表达式......
  • rust 和 golang 的特点及适用场景
     Rust特点及适用场景:内存安全:Rust通过所有权系统和生命周期管理在编译时保证内存安全,有效防止了空指针异常、数据竞争、悬挂指针等问题,非常适合开发高性能系统软件和底层库。并发模型:Rust提供了强大的并发工具,如async/await和通道(channel),支持高效且安全......
  • 从11个视角看全球Rust程序员2/4:深度解读JetBrains最新报告
    讲动人的故事,写懂人的代码5Rust代码最常使用什么协议与其他代码交互?RESTAPI:2022年:51%2023年:51%看上去RESTAPI的使用比例挺稳定的,没啥变化。语言互操作性(LanguageInterop):2022年:53%2023年:43%语言互操作性的比例在2023年下来了一些,掉了10个百分点。远......