首页 > 编程语言 >java爬虫之HtmlUnit介绍

java爬虫之HtmlUnit介绍

时间:2022-10-20 17:00:52浏览次数:55  
标签:http HtmlUnit java 爬虫 client org apache import

前端有时候会遇到项目临时需要网上收集数据的情况,什么方案是简单易懂、长期可用的呢,当然是用浏览器终端测试单元做爬虫是最方便的啦,将平时工作中的测试程序进行简单的修改,然后配合爬虫代理,就可以马上开始数据采集,是不是很方便呀。

刚好之前也分享了一篇关于java爬虫的文章,那今天也是爬虫方面的知识,我们可以继续分享下java爬虫。不知道做学java的对HtmlUnit熟悉不呢?它是是java下的一款无头浏览器方案,通过相应的API模拟HTML协议,可以请求页面,提交表单,打开链接等等操作,完全模拟用户终端。支持复杂的JavaScript、AJAX库,可以模拟多种浏览器,包括Chrome,Firefox或IE等。

下面提供一个简单的demo,通过调用爬虫代理访问IP查询网站,如果将目标网站修改为需要采集的数据链接,即可获取相应的数据,再加上数据分析模块就可以基本使用,示例是根据实际项目需求写的,看下要复杂些:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.net.URI;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.http.Header;
import org.apache.http.HeaderElement;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.HttpRequestRetryHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.config.AuthSchemes;
import org.apache.http.client.entity.GzipDecompressingEntity;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.LayeredConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.client.ProxyAuthenticationStrategy;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicHeader;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.NameValuePair;
import org.apache.http.util.EntityUtils;

public class Demo
{
// 代理服务器(产品官网 www.16yun.cn)
final static String proxyHost = "t.16yun.cn";
final static Integer proxyPort = 31000;

// 代理验证信息
final static String proxyUser = "username";
final static String proxyPass = "password";

 


private static PoolingHttpClientConnectionManager cm = null;
private static HttpRequestRetryHandler httpRequestRetryHandler = null;
private static HttpHost proxy = null;

private static CredentialsProvider credsProvider = null;
private static RequestConfig reqConfig = null;

static {
ConnectionSocketFactory plainsf = PlainConnectionSocketFactory.getSocketFactory();
LayeredConnectionSocketFactory sslsf = SSLConnectionSocketFactory.getSocketFactory();

Registry registry = RegistryBuilder.create()
.register("http", plainsf)
.register("https", sslsf)
.build();

cm = new PoolingHttpClientConnectionManager(registry);
cm.setMaxTotal(20);
cm.setDefaultMaxPerRoute(5);

proxy = new HttpHost(proxyHost, proxyPort, "http");

credsProvider = new BasicCredentialsProvider();
credsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(proxyUser, proxyPass));

reqConfig = RequestConfig.custom()
.setConnectionRequestTimeout(5000)
.setConnectTimeout(5000)
.setSocketTimeout(5000)
.setExpectContinueEnabled(false)
.setProxy(new HttpHost(proxyHost, proxyPort))
.build();
}

public static void doRequest(HttpRequestBase httpReq) {
CloseableHttpResponse httpResp = null;

try {
setHeaders(httpReq);

httpReq.setConfig(reqConfig);

CloseableHttpClient httpClient = HttpClients.custom()
.setConnectionManager(cm)
.setDefaultCredentialsProvider(credsProvider)
.build();

//设置TCP keep alive,访问https网站时保持IP不切换
// SocketConfig socketConfig = SocketConfig.custom().setSoKeepAlive(true).setSoTimeout(3600000).build();
// CloseableHttpClient httpClient = HttpClients.custom()
// .setConnectionManager(cm)
// .setDefaultCredentialsProvider(credsProvider)
// .setDefaultSocketConfig(socketConfig)
// .build();


AuthCache authCache = new BasicAuthCache();
authCache.put(proxy, new BasicScheme());
// 如果遇到407,可以设置代理认证 Proxy-Authenticate
// authCache.put(proxy, new BasicScheme(ChallengeState.PROXY));

HttpClientContext localContext = HttpClientContext.create();
localContext.setAuthCache(authCache);

httpResp = httpClient.execute(httpReq, localContext);

int statusCode = httpResp.getStatusLine().getStatusCode();

System.out.println(statusCode);

BufferedReader rd = new BufferedReader(new InputStreamReader(httpResp.getEntity().getContent()));

String line = "";
while((line = rd.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (httpResp != null) {
httpResp.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

/**
* 设置请求头
*
* @param httpReq
*/
private static void setHeaders(HttpRequestBase httpReq) {

// 设置Proxy-Tunnel
// Random random = new Random();
// int tunnel = random.nextInt(10000);
// httpReq.setHeader("Proxy-Tunnel", String.valueOf(tunnel));

httpReq.setHeader("Accept-Encoding", null);

}


public static void doGetRequest() {
// 要访问的目标页面
String targetUrl = "https://httpbin.org/ip";


try {
HttpGet httpGet = new HttpGet(targetUrl);

doRequest(httpGet);
} catch (Exception e) {
e.printStackTrace();
}
}

public static void main(String[] args) {
doGetRequest();


}
}

示例参考来源于亿牛云,因之前的业务需求购买了代理,一直都还在使用。刚好分享这篇文章就一起分享给大家了,在代理方面有需求的可以试试他们家提供的隧道代理,是我使用众多代理商里面IP质量好,售后服务也最好的一家

标签:http,HtmlUnit,java,爬虫,client,org,apache,import
From: https://www.cnblogs.com/mmz77-aa/p/16810451.html

相关文章

  • java--DataFormat--用户注册小练习
    测试结果:DataUtil工具类packagejavasm.util;​importjava.text.DateFormat;importjava.text.ParseException;importjava.text.SimpleDateFormat;importjava.util.D......
  • Jsoup爬虫的简单使用
    添加POM依赖<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.7.3</version></dependency>JAVA代码示例pub......
  • java变量的命名规范
    变量的命名规范所有变量、方法、类名:见名知意类成员变量:首字母小写和驼峰原则:monthSalary除了第一个单词以外,后面的单词首字母大写lastnamelastName局部变量:首......
  • Mybatis 插入时设置参数异常: Invalid argument value: java.io.NotSerializableExcept
    记录一个因为同事代码生成造成的问题因为代码中大量的自动生成代码,所以看到LongVARCHAR我也没有过多怀疑  最后定位发现还是自动生成的问题,只是原有的生成代码未使......
  • MQTT——java简单测试
    服务端代码:1packagebsit.mqtt.demo.one_way;23importorg.eclipse.paho.client.mqttv3.MqttClient;4importorg.eclipse.paho.client.mqttv3.MqttConnectOp......
  • Java Instrumentation
    文章目录一、前言二、热部署初识三、JavaInstrumentation四、JavaInstrumentation静态代码示例五、JavaAgent示例——attach的使用 一、......
  • java实现调用http请求的几种常见方式
    一、概述在实际开发过程中,我们经常需要调用对方提供的接口或测试自己写的接口是否合适。很多项目都会封装规定好本身项目的接口规范,所以大多数需要去调用对方提供的接口或......
  • Java I/O(3):NIO中的Buffer
    您好,我是湘王,这是我的51CTO博客,欢迎您来,欢迎您再来~​​之前在调用Channel的代码中,使用了一个名叫ByteBuffer类,它是Buffer的子类。这个叫Buffer的类是专门用来解决高速设备与......
  • JMeter 扩展开发:自定义 Java Sampler
    JMeter内置支持了一系列的常用协议,例如HTTP/HTTPS、FTP、JDBC、JMS、SOAP和TCP等,可以直接通过编写脚本来支持相关协议的测试场景。除了这些协议之外,用户也可能需要进行......
  • java spring boot 项目启动配置由.properties改为.yml。failed to configure a dataso
    因为yml的文件结构可以少打字,就想着把.properties的配置文件改为.yml的,结果发现坑还不少,在此记录一下。1、安装相应的plugins    2、添加相应的文件名 3、设......