首页 > 其他分享 >WebMagic爬取北京市政信件内容

WebMagic爬取北京市政信件内容

时间:2023-08-04 22:22:58浏览次数:40  
标签:String get replace 爬取 letter 信件 import page WebMagic

我采用创建了Letter类用来储存信件,重写了LetterFilePipeline使得爬取保存的文件名为信件Id,采用了多线程爬取,最后保存到letters目录下

Letter

package org.example.crawler_letter;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

@Data
@AllArgsConstructor
@NoArgsConstructor
public class Letter {
    private String originalId;
    private String letterType;
    private String letterTypeName;
    private String letterTitle;
    private String showLetterTitle;
    private String writeDate;
    private String orgNames;
    private String showOrgNames;
    private String writeName;
    private String answerDate;
    private String question;
    private String answer;
}

LetterFilePipeline

package org.example.crawler_letter;

import com.alibaba.fastjson.JSON;
import lombok.SneakyThrows;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import java.io.FileWriter;
import java.io.PrintWriter;

public class LetterFilePipeline extends JsonFilePipeline {
    public LetterFilePipeline(String path) {
        super(path);
    }

    @SneakyThrows
    @Override
    public void process(ResultItems resultItems, Task task) {
        String path = this.path + PATH_SEPERATOR + task.getUUID() + PATH_SEPERATOR;
        String fileName = resultItems.get("originalId");
        PrintWriter printWriter = new PrintWriter(new FileWriter(this.getFile(path + fileName + ".json")));
        printWriter.write((JSON.toJSONString(resultItems.getAll())));
        printWriter.close();
    }
}

LetterProcess

package org.example.crawler_letter;

import org.jsoup.nodes.Document;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.Selectable;

public class LetterProcess implements PageProcessor {
    private Site site=new Site();
    private static int num=0;
    @Override
    public void process(Page page) {
        System.out.println(page.getUrl());
        System.out.println(num++);
        String url= String.valueOf(page.getUrl());
        page.putField("originalId", url.substring(url.lastIndexOf("=") + 1));
        Document doc = page.getHtml().getDocument();
        page.putField("letterTitle",doc.select("strong").first().text());
        page.putField("writeName",doc.getElementsByClass("col-xs-10 col-lg-3 col-sm-3 col-md-4 text-muted").text().substring(4));
        page.putField("question",doc.getElementsByClass("col-xs-12 col-md-12 column p-2 text-muted mx-2").text());
        page.putField("answer",doc.getElementsByClass("col-xs-12 col-md-12 column p-4 text-muted my-3").text());
        page.putField("answerDate",doc.getElementsByClass("col-xs-12 col-sm-3 col-md-3 my-2").text().substring(5));
    }

    @Override
    public Site getSite() {
        site.setCharset("UTF-8");
        site.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188");
        return site;
    }
}

LetterMain

package org.example.crawler_letter;

import lombok.SneakyThrows;
import org.codehaus.jackson.JsonNode;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.node.ArrayNode;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import us.codecraft.webmagic.Spider;

import java.io.File;
import java.net.URL;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;

public class LetterMain {

    private static String get_json(String s){
        s=s.replace("page:","\"page\":");
        s=s.replace("pageNo:","\"pageNo\":");
        s=s.replace("totalCount:","\"totalCount\":");
        s=s.replace("totalPages:","\"totalPages\":");
        s=s.replace("pageSize:","\"pageSize\":");
        s=s.replace("result:","\"result\":");
        s=s.replace("originalId:","\"originalId\":");
        s=s.replace("letterType:","\"letterType\":");
        s=s.replace("letterTypeName:","\"letterTypeName\":");
        s=s.replace("letterTitle:","\"letterTitle\":");
        s=s.replace("showLetterTitle:","\"showLetterTitle\":");
        s=s.replace("writeDate:","\"writeDate\":");
        s=s.replace("orgNames:","\"orgNames\":");
        s=s.replace("showOrgNames:","\"showOrgNames\":");
        s=s.replace("\'","\"");
        return s;
    }
    @SneakyThrows
    public static void main(String[] args) {
        Document start_page = Jsoup.parse(new URL("https://www.beijing.gov.cn/hudong/hdjl/sindex/bjah-index-hdjl!letterListJson.action?keyword=&startDate=&endDate=&letterType=0&page.pageNo=1&page.pageSize=0&orgtitleLength=26"), 30000);
        String json_start=start_page.text();
        json_start=get_json(json_start);
        ObjectMapper objectMapper=new ObjectMapper();
        JsonNode jsonNode=objectMapper.readTree( json_start);
        String num= String.valueOf(jsonNode.get("page").get("totalCount")).replace("\"","");
        Document end_page=Jsoup.parse(new URL("https://www.beijing.gov.cn/hudong/hdjl/sindex/bjah-index-hdjl!letterListJson.action?keyword=&startDate=&endDate=&letterType=0&page.pageNo=1&page.pageSize="+num+"&orgtitleLength=26"),3000);
        String json_end=end_page.text();
        json_end=get_json(json_end);
        jsonNode=objectMapper.readTree(json_end);
        List<Letter> letters=new ArrayList<>();
        ArrayNode arrayNode= (ArrayNode) jsonNode.get("result");
        for(JsonNode i:arrayNode){
            Letter letter=new Letter();
            letter.setOriginalId(i.get("originalId").toString().replace("\"",""));
            letter.setLetterType(i.get("letterType").toString().replace("\"",""));
            letter.setLetterTypeName(i.get("letterTypeName").toString().replace("\"",""));
            letter.setLetterTitle(i.get("letterTitle").toString().replace("\"",""));
            letter.setShowLetterTitle(i.get("showLetterTitle").toString().replace("\"",""));
            letter.setWriteDate(i.get("writeDate").toString().replace("\"",""));
            letter.setOrgNames(i.get("orgNames").toString().replace("\"",""));
            letter.setShowOrgNames(i.get("showOrgNames").toString().replace("\"",""));
            letters.add(letter);
        }
        List<String> urlList=new ArrayList<>();
        for(Letter i:letters){
            if(i.getLetterTypeName().equals("咨询")) {
                urlList.add("https://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId="+i.getOriginalId());
            }
            else {
                urlList.add("https://www.beijing.gov.cn/hudong/hdjl/com.web.suggest.suggesDetail.flow?originalId="+i.getOriginalId());
            }
        }
        String[] urls=urlList.toArray(new String[0]);
        Spider spider = Spider.create(new LetterProcess());
        spider.addUrl(urls);
        spider.thread(50);
        spider.addPipeline(new LetterFilePipeline("D:\\JavaProject\\Lab\\LetterProject\\src\\data\\letter_json"));
        spider.run();
        letters.sort(new Comparator<Letter>() {
            @Override
            public int compare(Letter o1, Letter o2) {
                return o1.getOriginalId().compareTo(o2.getOriginalId());
            }
        });
        File[]  files = new File("D:\\JavaProject\\Lab\\LetterProject\\src\\data\\letter_json\\www.beijing.gov.cn").listFiles();
        for(int i=0;i<files.length;i++){
            File file=files[i];
            Letter letter=objectMapper.readValue(file,Letter.class);
            letters.get(i).setLetterTitle(letter.getLetterTitle());
            letters.get(i).setWriteName(letter.getWriteName());
            letters.get(i).setQuestion(letter.getQuestion());
            letters.get(i).setAnswer(letter.getAnswer());
            letters.get(i).setAnswerDate(letter.getAnswerDate());
        }
        for(Letter i:letters){
            File outputFile = new File("D:\\JavaProject\\Lab\\LetterProject\\src\\data\\letters\\"+i.getOriginalId()+".json");
            objectMapper.writeValue(outputFile, i);
        }
    }
}

 

标签:String,get,replace,爬取,letter,信件,import,page,WebMagic
From: https://www.cnblogs.com/liyiyang/p/17607206.html

相关文章

  • 如果通过POWER BI爬取网页信息
    问题描述:同事想收集电商网站上面的竞品信息,再通过使用POWERBI作为分析工具,进行相关的分析。今天过来找我询问,是否有合适的工具可以方便抓取到页面上面的竞品信息? 解决方案:通过POWERBIDesktop自带功能实现抓取网页上面的信息。优势就是出成果快。 再给同事......
  • Python爬虫爬取B站评论区
    写了两天,参考其他大牛的文章,摸着石头过河,终于写出了一个可以爬B站评论区的爬虫,人裂了……致谢:致谢:SmartCrane马哥python说该程序具有以下功能:①输入B站视频链接,即可爬取B站评论区评论、IP、ID、点赞数、回复数,并保存在当前目录的以视频名字为标题的csv文件中。②由视频链......
  • ChatGPT炒股:爬取股票官方微信公众号的新闻资讯
    上市公司的微信公众号,现在已经成为官网之外最重要的官方信息发布渠道。有些不会在股票公告中发布的消息,也会在微信公众号进行发布。所以,跟踪持仓股票的公众号信息,非常重要。下面,以贝特瑞的官方公众号“贝特瑞新材料”为例,来说明如何利用ChatGPT来爬取公司的公众号内容。首先,要登陆......
  • python爬取壁纸图片到本地
    源码#!/usr/bin/pythonimportrandomimportrequestsimportreimporttimefornuminrange(2,212): #url网页地址url="https://pic.netbian.com/new/index_"+str(num)+".html"#需要爬取图片的网页地址page=requests.get(url).text#得到网页源码#......
  • 如何用python做一个exe程序快速爬取文章?
    我用了99藏书网作为例子九九藏书网(99csw.com)注:本程序主要用于快速复制99藏书网中的小说,有些参数我要在开头先解释清楚 一、导入库importtkinterastkfromseleniumimportwebdriverfromselenium.webdriver.common.byimportByfromselenium.webdriver.common.a......
  • 爬虫js基础网站爬取
    福建省公共资源交易电子平台  constCrypto=require('C://Users/lenovo/AppData/Roaming/npm/node_modules/crypto-js')t='N1jfMuHUNZzAwf7B5RzFD4rFfAG6IKSViOy+Bi1+vBS6tdj0qUgLXgTOefWa+x6oF2jKxCxIV62Atqmctwh3bbhQX5MFcDEcyWUNmlnnpi27ntuh6BEgQKRojUfPY7yfuy......
  • CSDN这么公然爬取(piao qie)cnblogs的文章,给钱了吗?
    在CSDN网站经常看到有博客转载cnblogs的文章,开始还以为是网友自行转载,后来才发现,这些所谓的转载应该都是机器爬取(piaoqie)过去的。不知道cnblogs对此怎么看。下面看看几个示例博主发博客的时间比它注册博客的时间还早,而且转载的时间和原稿发布时间分秒不差。这爬取也太直白了......
  • python爬虫实战——小说爬取
    python爬虫实战——小说爬取基于requests库和lxml库编写的爬虫,目标小说网站域名http://www.365kk.cc/,类似的小说网站殊途同归,均可采用本文方法爬取。目标网站:传送门本文的目标书籍:《我的师兄实在太稳健了》“渡劫只有九成八的把握,和送死有什么区别?”基本思路网络爬虫的工作实际上主......
  • 爬虫 | 小米应用商店 APP 排行榜爬取
    本实验将从HTTP协议开始为你讲述爬虫的底层原理,之后将HTTP协议与requests库进行知识关联,为你解释requests库是如何实现HTTP协议中的相关内容。在实验后半节将为大家讲解re模块与正则表达式的泛应用技巧,该技巧可以极大地提高正则表达式编写速度与Python爬虫编写效率......
  • 【爬虫案例】用Python爬取抖音热榜数据!
    目录一、爬取目标二、编写爬虫代码三、同步讲解视频3.1代码演示视频四、获取完整源码一、爬取目标您好,我是@马哥python说,一名10年程序猿。本次爬取的目标是:抖音热榜共爬取到50条数据,对应TOP50热榜。含5个字段,分别是:热榜排名,热榜标题,热榜时间,热度值,热榜标签。用Chrom......