首页 > 其他分享 >获取省市区镇爬虫

获取省市区镇爬虫

时间:2022-12-02 17:46:31浏览次数:37  
标签:city String City 爬虫 ele 获取 import 省市区 type

 

  1 package com.mock.utils;
  2 
  3 import java.io.IOException;
  4 import java.net.MalformedURLException;
  5 import java.util.ArrayList;
  6 import java.util.List;
  7 
  8 import org.jsoup.Jsoup;
  9 import org.jsoup.nodes.Document;
 10 import org.jsoup.nodes.Element;
 11 import org.jsoup.select.Elements;
 12 
 13 import com.gargoylesoftware.htmlunit.BrowserVersion;
 14 import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
 15 import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
 16 import com.gargoylesoftware.htmlunit.WebClient;
 17 import com.gargoylesoftware.htmlunit.WebClientOptions;
 18 import com.gargoylesoftware.htmlunit.html.HtmlPage;
 19 import com.justsy.army.mgt.mock.model.City;
 20 
 21 public class NationalBureauOfStatics {
 22     private static final String ADDRESS = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2021/";
 23     private static final String fix = ".html";
 24 
 25     public static void main(String[] args) {
 26         List<City> provinceList = new ArrayList<>();
 27         List<City> cityList = new ArrayList<>();
 28         List<City> countyList = new ArrayList<>();
 29         List<City> townList = new ArrayList<>();
 30         provinceList = getTVMall(provinceList, new City(), ADDRESS, 0);
 31         for (City city : provinceList) {
 32             cityList = getTVMall(cityList, city, city.getHtmlAddr(), 1);
 33         }
 34         for (City city : cityList) {
 35             countyList = getTVMall(countyList, city, city.getHtmlAddr(), 2);
 36         }
 37         for (City city : countyList) {
 38             townList = getTVMall(townList, city, city.getHtmlAddr(), 3);
 39         }
 40 
 41         for (City city : townList) {
 42             System.out.println(city.toString());
 43         }
 44     }
 45 
 46     public static List<City> getTVMall(List<City> list, City city, String address, int type) {
 47         WebClient webClient = new WebClient(BrowserVersion.CHROME);
 48         // webclient参数载体
 49         WebClientOptions clientOptions = webClient.getOptions();
 50         // 设置webClient的相关参数
 51         clientOptions.setJavaScriptEnabled(true);
 52         clientOptions.setCssEnabled(false);
 53         webClient.setAjaxController(new NicelyResynchronizingAjaxController());
 54         clientOptions.setTimeout(35000);
 55         clientOptions.setThrowExceptionOnScriptError(false);
 56         try {
 57             HtmlPage htmlPage = webClient.getPage(address);
 58             Document dom = Jsoup.parse(htmlPage.asXml());
 59             Elements ele = null;
 60             if (type == 0) {
 61                 ele = dom.getElementsByClass("provincetable");
 62             } else if (type == 1) {
 63                 ele = dom.getElementsByClass("citytable");
 64             } else if (type == 2) {
 65                 ele = dom.getElementsByClass("countytable");
 66             } else if (type == 3) {
 67                 ele = dom.getElementsByClass("towntable");
 68             }
 69             dom = Jsoup.parse(ele.toString());
 70             ele = dom.getElementsByTag("tr");
 71             if (ele != null) {
 72                 getList(list, ele, city, type);
 73             }
 74         } catch (FailingHttpStatusCodeException e) {
 75             e.printStackTrace();
 76         } catch (MalformedURLException e) {
 77             e.printStackTrace();
 78         } catch (IOException e) {
 79             e.printStackTrace();
 80         }
 81         return list;
 82     }
 83 
 84     private static List<City> getList(List<City> list, Elements ele, City city, int type) {
 85         if (type == 0) {
 86             for (int i = 3; i < ele.size(); i++) {
 87                 Element item = ele.get(i);
 88                 Elements aElements = item.getElementsByTag("a");
 89                 for (int j = 0; j < aElements.size(); j++) {
 90                     City c = new City();
 91                     String html = aElements.get(j).attr("href");
 92                     String name = aElements.get(j).text();
 93                     c.setProvince(name);
 94                     c.setHtmlAddr(ADDRESS + html);
 95                     c.setCode(html.replace(fix, "0000000000"));
 96                     list.add(c);
 97                 }
 98             }
 99             return list;
100         }
101         for (int i = 0; i < ele.size(); i++) {
102             Element item = ele.get(i);
103             Elements aElements = item.getElementsByTag("a");
104             if (aElements.size() > 0) {
105                 City c = new City();
106                 String html = aElements.get(0).attr("href");
107                 String code = aElements.get(0).text();
108                 String name = aElements.get(1).text();
109                 if (type == 1) {
110                     c.setProvince(city.getProvince());
111                     c.setCity(name);
112                 } else if (type == 2) {
113                     c.setProvince(city.getProvince());
114                     c.setCity(city.getCity());
115                     c.setCounty(name);
116                 } else if (type == 3) {
117                     c.setProvince(city.getProvince());
118                     c.setCity(city.getCity());
119                     c.setCounty(city.getCounty());
120                     c.setTown(name);
121                 }
122                 c.setCode(code);
123                 String provinceCode = city.getCode().substring(0, 2);
124                 if (!html.startsWith(provinceCode + "/")) {
125                     html = provinceCode + "/" + html;
126                 }
127                 c.setHtmlAddr(ADDRESS + html);
128                 list.add(c);
129                 System.out.println(c.toString());
130             }
131         }
132         return list;
133     }
134 }

 

标签:city,String,City,爬虫,ele,获取,import,省市区,type
From: https://www.cnblogs.com/lixiuming521125/p/16945160.html

相关文章

  • Springboot 获取 resource 下的文件夹路径的坑
    现在有个需求是需要利用模板文件生成HTML或者PDF文件,然后由于HTML模板文件里面包含图片和字体,然后我就在resource文件夹下新建了一个文件夹,然后又分类,也就是resource......
  • Linux运维获取内存、cpu、磁盘IO信息
    一、脚本今天主要分享一个shell脚本,用来获取linux系统CPU、内存、磁盘IO等信息。#!/bin/bash#获取要监控的本地服务器IP地址IP=`ifconfig|grepinet|grep-vE'in......
  • 【k哥爬虫普法】爬取数据是否一定构成不正当竞争?
    我国目前并未出台专门针对网络爬虫技术的法律规范,但在司法实践中,相关判决已屡见不鲜,K哥特设了“K哥爬虫普法”专栏,本栏目通过对真实案例的分析,旨在提高广大爬虫工程师的......
  • 利用python获取计算机主机名:用户设置的计算机名称
    1、第一种方法importsockethostname=socket.gethostname()print(hostname)运行结果:2、第二种方法importplatformhostname=platform.node()print(hos......
  • 【k哥爬虫普法】爬取数据是否一定构成不正当竞争?
    我国目前并未出台专门针对网络爬虫技术的法律规范,但在司法实践中,相关判决已屡见不鲜,K哥特设了“K哥爬虫普法”专栏,本栏目通过对真实案例的分析,旨在提高广大爬虫工程师的......
  • python爬虫为什么很多公司都需要?
    python爬虫在如今大数据时代是越来越重要,却发现,都没有人总结Python爬虫可以用来做什么,从而导致学习Python爬虫的小伙伴略有点迷茫。1、学习爬虫,可以私人订制一个搜索引擎,并......
  • iOS上架之App Uploader辅助工具激活码获取
    1. 点击图示的激活,获取激活码。2. 如图示,点击链接:​​http://www.applicationloader.net/purchase.html​​。会跳转购买激活码页面,购买成功即可生成激活码,购买的激活码越......
  • Python怎样写个通用爬虫模板?
    其实经常写爬虫的程序员应该都知道,做一个爬虫工作无非就是三个步骤:下载数据、解析数据、保存数据。基本所有爬虫万变不离其宗,都是这样的套路。下面就是Python写出来一个完......
  • 企业如何有效的防爬虫?
    防爬虫,简单来说,就是在尽量减少对正常用户的干扰的情况下尽可能的加大爬虫的成本。而反防爬虫,或者说制造隐蔽性强的爬虫,就是尽量模拟正常用户的行为。这两者是动态博弈的。......
  • 获取参数,把方法上的参数绑定到注解的变量中
    packagecom.geekmore.modules.device.aop;importjava.lang.reflect.Method;importorg.aspectj.lang.JoinPoint;importorg.aspectj.lang.reflect.MethodSignature;/***......