介绍
Hue支持以下服务
HDFS、HBase、S3
Hive、Impala、Presto、Trino、Flink、Phoenix、Druid
Solr、ES、ClickHouse
Spark、Yarn、Ooize、Knox
Oracle、PG、DB2、Vertica、MYSQL
最新5.x版本支持Ozone
用于数据库、数据仓库甚至数据湖的 Hadoop 用户体验 (HUE)
Hue 是一个基于 Web 的开源图形用户界面,它将多个不同的 Hadoop 生态系统项目组合到一个可配置的界面中。
编译Hue
注意:Hue 4.11 是支持 Python 2.7 的最后一个版本。新版本将基于 Python 3.8 及更高版本
Python版本
- Python3.8
- Python3.9
官方
设置镜像
pip install pip -U pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
注意
采用Python2.7编译比较顺利,Python3.8会出现各种模块缺失错误,暂时没有解决,怀疑Python安装的有问题,毕竟官方是支持Python3.8的,待定
安装Python3.8
安装Anacoda的方式,这样方便切换Python版本
export export PYTHON_VER=python3.8
ln -s xxxx/pip3.8 /usr/bin
ln -s xxxx/python3.8
PREFIX=/srv/dstore/1.0.0.0/ make install # 注意,编译成功后,虚拟环境下的python文件头带这个路径,如果移到别的目录,则需要在/hue/build/env/bin下修改为上述的目录
pip install virtualenv
yum -y install rsync
yum -y install ant asciidoc cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain gcc gcc-c++ krb5-devel libffi-devel libxml2-devel libxslt-devel make mysql mysql-devel openldap-devel python-devel sqlite-devel gmp-devel
安装node.js
wget https://nodejs.org/dist/v16.13.0/node-v16.13.0-linux-x64.tar.xz
tar -xvf node-v16.13.0-linux-x64.tar.xz
export PATH=$PATH:/opt/software/node-v16.13.0/bin
npm config set registry=http://registry.npm.taobao.org
安装maven
wget https://dlcdn.apache.org/maven/maven-3/3.8.3/binaries/apache-maven-3.8.3-bin.tar.gz
tar -zxvf apache-maven-3.8.3-bin.tar.gz
编辑settings.xml
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
开始编译
export PYTHON_VER=/xxx/xxx/python.xx # 如果不设置Python环境变量,则取默认python2.7
PREFIX=/srv/dstore/1.0.0.0 make install
在安装ES、Clickhouse的过程中遇到一些库问题,且接入Ozone需要Hue 5.x版本,所以尝试重新拉包编译
Python 3.8
错误
问题一
ImportError: libxslt.so.1: cannot open shared object file: No such file or directory
解决方式
yum -y install asciidoc
集成Hdfs
HDFS区分HA
修改Hue配置 pseudo-distributed.ini
# Configuration for HDFS NameNode
# ------------------------------------------------------------------------
[[hdfs_clusters]]
# HA support by using HttpFs
[[[default]]]
# Enter the filesystem uri
fs_defaultfs=hdfs://cluster
# NameNode logical name.
logical_name=cluster
# Use WebHdfs/HttpFs as the communication mechanism.
# Domain should be the NameNode or HttpFs host.
# Default port is 14000 for HttpFs.
webhdfs_url=http://sun75:14000/webhdfs/v1
# Change this if your HDFS cluster is Kerberos-secured
## security_enabled=false
# In secure mode (HTTPS), if SSL certificates from YARN Rest APIs
# have to be verified against certificate authority
## ssl_cert_ca_verify=True
# Directory of the Hadoop configuration
## hadoop_conf_dir=$HADOOP_CONF_DIR when set or '/etc/hadoop/conf'
hadoop_conf_dir=/srv/dstore/1.0.0.0/hdfs/etc/hadoop
# Whether Hue should list this HDFS cluster. For historical reason there is no way to disable HDFS.
## is_enabled=true
可以看到其中有Httpfs的配置信息,那么为啥需要这个东西?ChatGPT给出结论如下
- 跨平台和跨语言支持: HTTPFS提供了一种标准的HTTP访问方式,因此可以被各种编程语言和平台所支持,使得与HDFS的交互更加灵活和便利。
- 安全性: 通过HTTPFS,可以更容易地实现对HDFS的访问控制和安全认证,例如使用基于HTTP的身份验证和授权机制来保护数据。
- 防火墙穿透: 在某些网络环境下,由于安全策略的限制,直接访问Hadoop集群的API接口可能会受到限制,而HTTP协议通常更容易穿透防火墙。
- 简化配置: 使用HTTPFS可以简化客户端配置,因为它使用标准的HTTP协议,不需要像直接连接namenode的API接口那样进行复杂的配置。
总结:
尽管可以直接连接namenode的API接口进行文件系统操作,但使用HTTPFS仍然具有一定的优势,特别是在跨平台、安全性和网络环境受限的情况下。
从官方注释可知,可以采用2种不同的方式与HDFS通信
- WebHdfs:WebHdfs是Hadoop中的一种RESTful风格的HDFS访问协议。它通过HTTP协议提供对HDFS的访问,允许用户通过REST API执行文件系统操作。在Hue中,通过指定WebHdfs的URL,Hue可以使用HTTP来与HDFS进行通信。
- HttpFs:HttpFs是Hadoop的一个HTTP代理服务,允许通过HTTP方式访问HDFS。它提供了一个RESTful API,允许用户通过HTTP请求对HDFS进行文件系统操作。在这里,如果WebHdfs不可用,HttpFs可以作为备选的通信机制。
WebHdfs方式
HA 方式只能填处于
active
的NameNode节点
webhdfs_url=http://sun75:50070/webhdfs/v1
修改core-site.xml
配置 Hue 为其他用户和组的代理用
<!-- HUE -->
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name> # 这玩意不配,打死都起不来
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name> # 这玩意不配,打死都起不来
<value>*</value>
</property>
修改hdfs-site.xml
<!-- HUE -->
<property>
<!-- HDFS Web 服务 -->
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
HttpFs方式
必须在Hadoop集群启动HttpFs服务
webhdfs_url=http://sun75:14000/webhdfs/v1
在WebHdfs的基础上修改 httpfs-site.xml 文件
<property>
<name>httpfs.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>httpfs.proxyuser.hue.groups</name>
<value>*</value>
</property>
集成Yarn
本次配置为ResourceManager HA环境
# Configuration for YARN (MR2)
# ------------------------------------------------------------------------
[[yarn_clusters]]
[[[default]]]
# Enter the host on which you are running the ResourceManager
#resourcemanager_host=sun46 # ha环境无需
# The port where the ResourceManager IPC listens on
#resourcemanager_port=8032 # ha环境无需
# Whether to submit jobs to this cluster
submit_to=True
# Resource Manager logical name (required for HA)
logical_name=rm1
# Change this if your YARN cluster is Kerberos-secured
## security_enabled=false
# URL of the ResourceManager API
resourcemanager_api_url=http://sun46:8088
# URL of the ProxyServer API
proxy_api_url=http://sun46:8088
# URL of the HistoryServer API
history_server_api_url=http://sun54:18888
# URL of the Spark History Server
## spark_history_server_url=http://localhost:18088
# Change this if your Spark History Server is Kerberos-secured
## spark_history_server_security_enabled=false
# In secure mode (HTTPS), if SSL certificates from YARN Rest APIs
# have to be verified against certificate authority
## ssl_cert_ca_verify=True
# HA support by specifying multiple clusters.
# Redefine different properties there.
# e.g.
[[[ha]]]
# Resource Manager logical name (required for HA)
logical_name=rm2
# Un-comment to enable
submit_to=True
# URL of the ResourceManager API
resourcemanager_api_url=http://sun54:8088
history_server_api_url=http://sun54:19888
非HA环境
[[yarn_clusters]]
[[[default]]]
# Enter the host on which you are running the ResourceManager
resourcemanager_host=sun46
# The port where the ResourceManager IPC listens on
resourcemanager_port=8032
# Whether to submit jobs to this cluster
submit_to=True
# Resource Manager logical name (required for HA)
logical_name=yarn-cluster
# URL of the ResourceManager API
resourcemanager_api_url=http://sun46:8088
# URL of the ProxyServer API
proxy_api_url=http://sun46:8088
# URL of the HistoryServer API
history_server_api_url=http://sun46:18888
测试命令
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount 10 100
等待yarn任务结束后,登陆Hue平台查看
非hue用户没有权限搜索
集成Hbase
注意:Hue只支持HBase thrift1。如果使用的是thrift2会出现问题
默认HBase thrift server端口号为9090
官方配置:https://docs.gethue.com/administrator/configuration/connectors/#hbase
修改hbase-site.xml
<property>
<name>hbase.regionserver.thrift.http</name>
<value>true</value>
</property>
<property>
<name>hbase.thrift.support.proxyuser</name>
<value>true</value>
</property>
修改core-site.xml
<property>
<name>hadoop.proxyuser.hbase.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hbase.groups</name>
<value>*</value>
</property>
修改Hue配置
[hbase]
# Comma-separated list of HBase Thrift servers for clusters in the format of '(name|host:port)'.
# Use full hostname. If hbase.thrift.ssl.enabled in hbase-site is set to true, https will be used otherwise it will use http
# If using Kerberos we assume GSSAPI SASL, not PLAIN.
hbase_clusters=(Cluster|sun54:9090)
# HBase configuration directory, where hbase-site.xml is located.
hbase_conf_dir=/srv/dstore/1.0.0.0/hbase/conf
# Hard limit of rows or columns per row fetched before truncating.
## truncate_limit = 500
# Should come from hbase-site.xml, do not set. 'framed' is used to chunk up responses, used with the nonblocking server in Thrift but is not supported in Hue.
# 'buffered' used to be the default of the HBase Thrift Server. Default is buffered when not set in hbase-site.xml.
thrift_transport=buffered
# Choose whether Hue should validate certificates received from the server.
## ssl_cert_ca_verify=true
测试
集成Hive
[beeswax]
hive_server_host=sun46
hive_server_port=10000
hive_conf_dir=/srv/dstore/1.0.0.0/hive/conf
thrift_version=11
集成S3
[aws]
# Enable the detection of an IAM role providing the credentials automatically. It can take a few seconds.
## has_iam_detection=false
[[aws_accounts]]
# Default AWS account
[[[default]]]
# AWS credentials
access_key_id=AKIAIOSFODNN7EXAMPLE
secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
## security_token=
# Execute this script to produce the AWS access key ID.
## access_key_id_script=/path/access_key_id.sh
# Execute this script to produce the AWS secret access key.
## secret_access_key_script=/path/secret_access_key.sh
# Allow to use either environment variables or
# EC2 InstanceProfile to retrieve AWS credentials.
allow_environment_credentials=false
# AWS region to use, if no region is specified, will attempt to connect to standard s3.amazonaws.com endpoint
region=us-east-1
# Endpoint overrides
# host=
# Endpoint overrides
proxy_address=192.168.2.177
proxy_port=9878
## proxy_user=
## proxy_pass=
# Secure connections are the default, but this can be explicitly overridden:
is_secure=false
# The default calling format uses https://<bucket-name>.s3.amazonaws.com but
# this may not make sense if DNS is not configured in this way for custom endpoints.
# e.g. Use boto.s3.connection.OrdinaryCallingFormat for https://s3.amazonaws.com/<bucket-name>
calling_format=boto.s3.connection.OrdinaryCallingFormat
# The time in seconds before a delegate key is expired. Used when filebrowser/redirect_download is used. Default to 4 Hours.
## key_expiry=14400
集成Spark
两个连接器可用于与 SparkSql Thrift Server 通信
Apache Livy
Apache Livy 是一项服务,可通过 REST 接口轻松与 Spark 群集进行交互。它使 提交 Spark 作业或 Spark 代码片段、同步或异步结果检索以及 Spark 上下文管理,全部通过简单的 REST 接口或 RPC 客户端库进行。
安装Livy
wget https://dlcdn.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating_2.11-bin.zip --no-check-certificate
错误
{"msg":"Rejected, Reason: requirement failed: Cannot find Livy REPL jars."}
需要重新编译Livy以支持spark3.1
编译所需环境
- mvn (from package or maven3 tarball)
maven
- java-1.8.0-openjdk (or Oracle JDK 8)
- Python 2.7+
- R 3.x
注意点
python3.8
ln -s /usr/bin/python3.8 /usr/bin/python # livy会默认找/usr/bin/python
pip uninstall setuptools
pip install setuptools==39.1.0
R环境
yum -y install R
编译
mvn clean package -Pspark3 -Pscala-2.12 -DskipTests
安装包路径
/usr/local/src/incubator-livy-master/assembly/target
apache-livy-0.9.0-incubating-SNAPSHOT_2.12-bin.zip
安装成功后,浏览器访问http:localhost:8998,如图
上面有一行是applicationid,是因为后续的测试提交的任务,点击后会进入到spark
的webui
livy支持两种两种任务提交方案:
1、交互式:说白了,就是把原本在spark-shell里面执行的语句,通过http请求发送到livy服务端,然后livy在服务端开启spark-shell执行你传过来的语句;
2、批处理式:说白了,帮你做spark-submit的工作,同样通过http请求吧参数发到livy服务端。
交互式使用方法
工具采用postman,首先新建一个session(其实就是开启一个spark application):请求方式post,请求URL为 livy-ip:8998/sessions
请求体类似如下格式:
就是利用postman发送一个post请求,请求体是json格式,json中可以指定appName,申请多少内存,申请多少cpu等信息
然后会受到响应。
可以看到,开启了一个session,他的id=0
接下来,我们可以向该session发送scala语句进行执行,请求方式post,请求URL为 livy-ip:8998/sessions/{id}/statements
可以看到,这里的id=0,是指该statments的id
然后在发送另一个请求来查看输出,请求方式get,请求url为livy-ip:8998/sessions/1/statements/0
其中,output中的data就是输出的具体数据,progress是指进度,有时候查出来progress小于1,说明任务还没执行完。
前提是Apache livy启动
修改配置
[[[pyspark]]]
name=PySpark
interface=livy
[[[sparksql]]]
name=SparkSql
interface=livy
[[[spark]]]
name=Scala
interface=livy
[[[r]]]
name=R
interface=livy
[spark]
# The Livy Server URL.
livy_server_url=http://localhost:8998
[desktop]
app_blacklist=
[notebook]
show_notebooks=true
重启Hue,测试
curl -X POST --data '{"kind": "pyspark", "proxyUser": "bob"}' -H "Content-Type: application/json" localhost:8998/sessions
{"id":0,"state":"starting","kind":"pyspark","proxyUser":"bob","log":[]}
在hue webui查看任务执行情况
集成Spark SQL
方式一 、Thrift Server
Apache Livy 通过 Hue 提交一些 Spark SQL。Livy 最初是在 Hue 项目中创建的,并提供交互式或批处理 PySpark / Scala Spark /SparkSql 语句的轻量级提交。
然而,一个主要缺点是,它可能看起来不如 Spark 中附带的分布式 SQL 引擎(也称为“Thrift Server”)正式,Hue 可以通过两个接口连接到 Spark SQL
Thrift Server:
貌似需要重新编译Spark,暂时不做
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
Failed to load main class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
You need to build Spark with -Phive and -Phive-thriftserver.
方式二、Apach livy
已经实现
[spark]
# The Livy Server URL.
livy_server_url=http://localhost:8998
[[[sparksql]]]
name=Spark SQL
interface=livy
集成Ozone
https://gethue.com/blog/discover-the-power-of-apache-ozone-using-the-hue-file-browser/
Hue 5.0版本支持Ozone
Ozone1.4版本支持HttpFs
由Hue官网可知,需要使用HttpFs来操作Ozone,而Ozone在1.4版本才支持HttpFs,所以如果想要测试,需要编译Hue以及Ozone
查看Ozone最新提交记录
需要重新编译Ozone
最好再Linux服务上编译,如果失败可以重试几次
git clone https://github.com/apache/ozone.git
mvn clean package -DskipTests
安装源码包
hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT.tar.gz
需要额外启动httpfs
Hue配置
[[ozone]]
[[[default]]]
fs_defaultfs=ofs://[***SERVICE-ID***]
webhdfs_url=http://[***OZONE-HTTPFS-HOST***]:[***OZONE-HTTPFS-PORT***]/webhdfs/v1
ssl_cert_ca_verify=true
security_enabled=true
集成Flink
Flink SQL 网关,以便能够通过 Hue 编辑器提交 SQL 查询
Flink SQL 网关目前仅支持 Apache Flink 1.10
https://docs.gethue.com/administrator/configuration/connectors/#apache-flink
集成ES
pip install elasticsearch-dbapi==0.2.4 # Python2选择较老版本
[[[es]]]
name = Elastic Search
interface=sqlalchemy
options='{"url": "elasticsearch+http://192.168.2.75:9200"}'
集成Mysql
./build/env/bin/pip install mysqlclient
[[[mysql]]]
name=MySQL
interface=sqlalchemy
options='{"url": "mysql://root:root@localhost:3306/hue"}'
集成Oracle
./build/env/bin/pip install cx_Oracle
[[[oracle]]]
name = Oracle
interface=sqlalchemy
options='{"url": "oracle://system:[email protected]:1521/ORCL"}'
问题1
DatabaseError: DPI-1047: Cannot locate a 64-bit Oracle Client library: “libclntsh.so: cannot open shared object file: No such file or directory”. See https://oracle.github.io/odpi/doc/installation.html#linux for help
mkdir -p /opt/oracle
cd /opt/oracle
wget https://download.oracle.com/otn_software/linux/instantclient/193000/instantclient-basic-linux.x64-19.3.0.0.0dbru.zip
unzip instantclient-basic-linux.x64–19.3.0.0.0dbru.zip
sudo sh -c “echo /opt/oracle/instantclient_19_3 > /etc/ld.so.conf.d/oracle-instantclient.conf”
sudo ldconfig
export LD_LIBRARY_PATH=/opt/oracle/instantclient_19_3:$LD_LIBRARY_PATH
问题2
(cx_Oracle.DatabaseError) ORA-12504: TNS:listener was not given the SERVICE_NAME in CONNECT_DATA (Background on this error at: http://sqlalche.me/e/4xp6)
没有配置SERVICE_NAME
集成PG
./build/env/bin/pip install psycopg2
or
./build/env/bin/pip install psycopg2-binary
[[[postgresql]]]
name = postgresql
interface=sqlalchemy
options='{"url": "postgresql+psycopg2://postgres:[email protected]:5432/cvs"}' # csv为数据库
参考
标签:集成,Hue,http,name,url,livy,https,组件 From: https://www.cnblogs.com/nwnusun/p/17967887https://www.cnblogs.com/Likfees/p/15585532.html
https://docs.gethue.com/releases/release-notes-4.11.0/
https://blog.csdn.net/coderookie_run/article/details/108373170
https://blog.csdn.net/qq_25275355/article/details/124860691
https://blog.csdn.net/Meepoljd/article/details/122554037
https://dandelioncloud.cn/article/details/1419660854806777858
https://blog.csdn.net/weixin_34023982/article/details/86123778
https://blog.csdn.net/liweihope/article/details/122869416 #集成
https://demo.dandelioncloud.cn/article/details/1584360560164851713 # 集成
Yarn
Oracle