爬取自己的csdn目录

时间：2023-06-30 19:00:43浏览次数：41

标签：String random 爬取 csdn new 100 linkNode 目录 conn

导包就不细说了:

<!-- https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit -->
		<dependency>
			<groupId>net.sourceforge.htmlunit</groupId>
			<artifactId>htmlunit</artifactId>
			<version>2.35.0</version>
		</dependency>
		<!-- 解析html -->
		<dependency>
			<groupId>org.jsoup</groupId>
			<artifactId>jsoup</artifactId>
			<version>1.11.3</version>
		</dependency>
	<dependency>
			<groupId>fr.opensagres.xdocreport</groupId>
			<artifactId>fr.opensagres.xdocreport.converter.docx.xwpf</artifactId>
			<version>2.0.1</version>
		</dependency>
		
		<!-- 阿里JSON解析器 -->
		<dependency>
			<groupId>com.alibaba</groupId>
			<artifactId>fastjson</artifactId>
			<version>1.2.31</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-text -->
		<dependency>
			<groupId>org.apache.commons</groupId>
			<artifactId>commons-text</artifactId>
			<version>1.4</version>
		</dependency>

public static void main(String[] args) {
		String nam
		String url 

		//多少页:
		for (int i = 0; i < 14; i++) {
			String oneUrl = url + i;
			try {
				getCSDNArticleUrlList2(name,oneUrl,new ArrayList<String>());
			} catch (IOException e) {
				e.printStackTrace();
			}
		}

    }

    public static void getCSDNArticleUrlList2(String name, String oneUrl, List<String> urlList)
            throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        // 模拟浏览器操作
        InputStream inputStream = HttpUtil.doGet(oneUrl);
        String content = StreamUtil.inputStreamToString(inputStream, "UTF-8");
        Document doc = Jsoup.parse(content);
        Element pageMsg22 = doc.select("div.article-list").first();
        if (pageMsg22 == null) {
            return;
        }
        Elements pageMsg = pageMsg22.select("div.article-item-box");
        Element linkNode;
        for (Element e : pageMsg) {
            linkNode = e.select("h4 a").first();
            // 不知为何，所有的bloglist第一条都是
            if (linkNode.attr("href").contains(name)) {
//					System.out.println(linkNode.attr("href"));
                TextNode textNode = linkNode.textNodes().get(1);
                System.out.println("[" + textNode + "](" + linkNode.attr("href") + ")");
                urlList.add(linkNode.attr("href"));
            }
        }
        return;
    }

工具类方法,HttpUtil的一个,和流转字符串的一个

public static InputStream doGet(String urlstr, Map<String, String> headers) throws IOException {
        URL url = new URL(urlstr);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " +
                "(KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36");
        conn.setRequestProperty("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp," +
                "image/apng,*/*;q=0" +
                ".8");

        if (headers != null) {
            Set<String> keys = headers.keySet();
            for (String key : keys) {
                conn.setRequestProperty(key, headers.get(key));
            }
        }
        Random random = new Random();
        String ip =
                (random.nextInt(100) + 100) + "." + (random.nextInt(100) + 100) + "." + (random.nextInt(100) + 100) + "." + (random.nextInt(100) + 100);
        conn.setRequestProperty("x-forwarded-for", ip);
        InputStream inputStream = conn.getInputStream();
        return inputStream;
    }

  public static String inputStreamToString(InputStream is, String charset) throws IOException {

        byte[] bytes = new byte[1024];
        int byteLength = 0;
        StringBuffer sb = new StringBuffer();
        while ((byteLength = is.read(bytes)) != -1) {
            sb.append(new String(bytes, 0, byteLength, charset));
        }
        return sb.toString();
    }

爬取结果:

爬取自己的csdn目录_.net

标签：String,random,爬取,csdn,new,100,linkNode,目录,conn
From： https://blog.51cto.com/u_16174475/6592975

政府采购目录
政府采购目录是有关政府采购主管部门依据提高采购质量、降低采购成本的原则，对一些通用的、大批量的采购对象应纳入政府采购管理和进行集中采购而确定的、并由政府部门公布的货物、工程、服务的范围和具体的名称清单。政府采购目录可分政府集中采购目录和部门集中采购目录......
解决docker占用系统根目录磁盘的问题
方案：（同样可以适用高版本docker）当使用低版本的docker时，并没有–data-root指定，所以方案二提供软连接形式来解决该问题依旧先停止容器服务和docker服务直接移动/var/lib/docker至/home/.docker-datacp-r/var/lib/docker/home/.docker-data删除/var/lib/dockerrm-rf/var/lib/doc......
设备通过GB28181接入EasyCVR，设备列表多出一层目录是什么原因？
EasyCVR平台基于云边端协同架构，可支持多协议、多类型的海量设备接入与分发，平台既具备传统安防视频监控的能力，也能接入AI智能分析的能力，在线下均有大量应用。EasyCVR平台可提供的视频能力包括：视频监控直播、云端录像、云存储、录像检索与回看、智能告警、平台级联、云台控制、语音......
爬取大量数据有什么爬虫技巧？
爬虫数据在许多情况下都是非常有用的，爬虫数据提供了对市场和竞争对手的深入了解，可用于商业智能和市场调研。通过采集关于产品、评论、竞争对手策略等，企业可以做出更明智的决策。爬虫数据可用于构建内容聚合网站或搜索引擎。通过采集各种来源的数据，可以构建一个丰富、多样化的内容库......
爬取英雄联盟全皮肤+高清处理
喜欢lol原画的朋友喜欢把这些精美壁纸当成电脑桌面，要是能每天一换那就完美了。截止目前，英雄联盟共有英雄160多个，皮肤总数量高达1700多。这里就分享一下如何爬取皮肤吧。一.思路百度搜索任意一个皮肤网站，找到获取皮肤的接口，通过这个接口及其特有的规则，来做皮肤爬取......
[MEF]第04篇 MEF的多部件导入(ImportMany)和目录服务
一、演示概述此演示介绍了MEF如何使用ImportMany特性同时导入多个与相同约束相匹配的导出部件，并且介绍了目录服务（Catalog），该服务告知MEF框架可以在什么地方去搜寻与指定约束匹配的导出部件，即导出部件位于什么地方。相关下载（屏幕录像、代码）：http://yunpan.cn/cVdN5JHeQrJgI ......
当我用Python爬取了京东商品所有评论后发现....
不知道各位网购的时候，是否会去留意商品评价，有些小伙伴是很在意评价的，看到差评就不想买了，而有些小伙伴则是会对差评进行理性分析，而还有一类人不在乎这个。当然这都是题外话，咱们今天主要的目的是使用Python来爬取某东商品的评价，并保存到CSV表格。1、数据采集逻辑在进行数......
去掉一层目录linux
可以使用mv命令的通配符来去掉一层目录。例如，假设有一个名为/home/user/dir1/dir2/file.txt的文件，要将它移动到/home/user/dir2/目录下并去掉dir1目录，可以使用以下命令：mv/home/user/dir1/dir2/file.txt/home/user/dir2/如果要批量移动多个文件并去掉一层目录，可以使用通配符来匹......
机器学习算法系列——博客中相关机器学习算法的目录
前言这部分不是要介绍哪个具体的机器学习算法，前面做了一些机器学习的算法，本人在学习的过程中也去看别人写的材料，但是很多作者写的太难懂，或者就是放了太多的公式，所以我就想我来写点这方面的材料可以给大家参照，当然，由于本人才疏学浅，在写博客或者在写程序的过程中有什么不合理或......
数据挖掘目录
数据挖掘基础数据挖掘进阶:numpy-notepandas-note......

爬取自己的csdn目录

相关文章

赞助商

阅读排行