splash是一个JavaScript渲染服务，利用它可以爬取动态渲染的页面

一、简介

功能
- 异步处理多个网页的渲染过程
- 可以获取渲染后页面的源代码、截图、以及页面的加载过程信息（HAR，类似于浏览器开发工具中的网络加载）
- 执行特定的JavaScript脚本
- 通过Lua脚本控制页面的渲染过程
准备工作
- docker部署splash服务
  - 安装镜像
```
docker pull scrapinghub/splash
```
  - 运行splash服务
```
docker run --name splash -d -p 8050:8050 scrapinghub/splash --max-timeout 3600
```
    为了解决状态码504的问题，后面加上了--max-timeout参数，增大超时时间
- 测试
  - 本地浏览器访问http://127.0.0.1:8050/，即可查看splash的web页面
  - 在Render me!左侧的输入框中输入地址，比如动态渲染的页面：https://www.nmpa.gov.cn/zwfw/sdxx/sdxxyp/yppjfb/20230111151558195.html
  - 修改页面中的Lua脚本如下：
```
function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end
```
    wait后的时间修改为5秒，尽量确保加载完成
  - 点击Render me，即可查看渲染后页面的源代码、截图、以及页面的加载过程信息
官方参考文档：
- https://splash.readthedocs.io/en/stable/api.html

二、常用API

介绍
- splash提供了一些HTTP API，python程序只需要请求这些API并传递相应的参数，即可爬取页面渲染后的结果
render.html
- 获取渲染后的HTML代码
- 常用参数：
  - url：指定渲染的url
  - wait：加载页面后需要等待的时间（默认0），要确保页面被完全加载出来，需要手动设置该参数值，比如5，注意数值不能超过timeout参数值
  - timeout：渲染超时时间，默认30秒，最大可设置为90秒，除非启动splash服务时，通过参数--max-timeout进行指定
  - resource_timeout：单个网络请求的超时时间
  - http_method：请求方式，默认为GET，但是也支持POST请求
  - proxy：设置代理，格式[protocol://][user:password@]proxyhost[:port]，协议只能是http或者socks5
  - images：是否加载图片，可取值1（加载，默认值）或0（不加载）
  - headers：请求头设置，支持JSON数组或者对象形式，如果是JSON数组，注意元素必须是(header_name, header_value)元组形式
  - body：http_method为POST时的表单数据，默认请求头Content-Type为application/x-www-form-urlencoded，字符串类型，比如name=laowang&age=30
  - 其它参数可参照官方文档：https://splash.readthedocs.io/en/stable/api.html#render-html
- 示例：
```
import requests

api_url = 'http://127.0.0.1:8050/render.html'
args = {
    'url': 'http://www.httpbin.org/post',
    'wait': 5,
    'http_method': 'POST',
    'body': 'name=laowang&age=30'
}
response = requests.get(url=api_url, params=args)
print(response.text)
```
  requests向HTTP API发送GET请求，但是获取到的是POST请求结果
render.png
- 获取PNG格式页面截图二进制数据
- 参数：
  - width：设置截图的缩放宽度
  - height：设置截图的缩放高度
  - render_all：是否渲染并截取整个网页，取值为1（是，图片可能会非常高）或者0（否，默认值），取值为1时，需要设置wait参数
  - 其它：参照render.html
- 示例：
```
import requests

api_url = 'http://127.0.0.1:8050/render.png'
args = {
    'url': 'https://www.cnblogs.com/eliwang/p/17004910.html',
    'wait': 5,
    'images': 0,
    # 'width': 1000,
    # 'height': 700,
    'render_all': 1
}
response = requests.get(url=api_url, params=args)
with open('test.png', 'wb') as f:
    f.write(response.content)
```
  渲染并截图整个页面，对于页面中的图片不进行加载
render.jpeg
- 获取JPEG格式页面截图二进制数据
- 参数：
  - quality：设置图片质量，取值范围0-100，默认值75，应避免超过95
  - 其它：参照render.png
render.json
- 以JSON格式返回所需要的数据
  - 涵盖上述所有相关API功能
  - 通过参数来控制返回结果
- 参数：
  - 涵盖所有render.jpeg的参数
  - html：是否返回页面HTML源代码，取值0或1，默认0
  - png：是否返回PNG格式的页面截图（经过了base64加密），取值0或1，默认0
  - iframes：是否返回子frames，取值0或1，默认0
  - 其它参考官方文档

execute

可实现与Lua脚本的对接，自由控制获取细节，功能最为强大
参数：
- lua_source：自动化脚本，字符串类型
- 其它参考官方文档

示例：

import requests
import base64

lua_source = '''
function main(splash, args)
  assert(splash:go("https://www.baidu.com"))
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png()
  }
end
'''

api_url = 'http://127.0.0.1:8050/execute'
args = {
    'lua_source': lua_source
}

response = requests.get(url=api_url, params=args)
result = response.json()

# 打印html
print(result.get('html'))

# 下载png图片
with open('test.png', 'wb') as f:
    f.write(base64.b64decode(result.get('png')))

标签：render,url,html,splash,使用,png,页面
From： https://www.cnblogs.com/eliwang/p/17087911.html

splash的使用

一、简介

功能

准备工作

docker部署splash服务

测试

官方参考文档：

二、常用API

介绍

render.html

获取渲染后的HTML代码

常用参数：

示例：

render.png

获取PNG格式页面截图二进制数据

参数：

示例：

render.jpeg

获取JPEG格式页面截图二进制数据

参数：

render.json

以JSON格式返回所需要的数据

execute

可实现与Lua脚本的对接，自由控制获取细节，功能最为强大

参数：

示例：

相关文章

赞助商

阅读排行