我们在之前的文章中已经尝试安装了hail和简单的使用
数据挖掘(五) -----基于Spark的可伸缩基因数据分析平台开源存储运算架构hail全面了解和安装
但是 我们发现 这种hail的运行方式 是需要进入到conda的hail的虚拟环境中才能运行的。
我们业务一般来说 都是在外层执行,还有其他的业务逻辑,所以最好还是在python的py程序中就能调用hail,而不是进入到conda的虚拟环境中。
所以本章记录 在python程序中使用hail。
我们基于client的模式进行访问,需要进入到client的pod中。 client的pod创建详情可参考
hadoop组件—spark实战----spark on k8s模式k8s原生方式安装spark2.4.4 client mode和使用
使用命令进入pod:
kubectl exec -ti spark-client-test -- bash
首先确认哪个版本的pip与我们的要使用的python版本对应使用命令
[root@spark-client-test-zzq spark-2.4.4-bin-hadoop2.7]# pip -V
pip 8.1.2 from /usr/local/lib/python3.6/site-packages/pip-8.1.2-py3.6.egg (python 3.6)
[root@spark-client-test-zzq spark-2.4.4-bin-hadoop2.7]# pip3 -V
pip 8.1.2 from /usr/local/lib/python3.6/site-packages/pip-8.1.2-py3.6.egg (python 3.6)
安装hail,使用命令
pip install hail
或者
pip3 install hail
在hail的环境中
创建python程序
vi test.py
输入内容如下:
import os
import socket
import hail as hl
os.environ['PYSPARK_PYTHON'] = 'python3.6'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3.6'
os.environ['PYSPARK_SUBMIT_ARGS'] = f' --master k8s://https://{os.getenv("KUBERNETES_SERVICE_HOST")}:{os.getenv("KUBERNETES_SERVICE_PORT")} --deploy-mode client --conf spark.executor.instances=1 --jars jars/hail-all-spark.jar --conf spark.driver.extraClassPath=jars/hail-all-spark.jar --conf spark.executor.extraClassPath=jars/hail-all-spark.jar --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator --conf spark.kubernetes.container.image=123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/spark-py --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.driver.host={socket.gethostbyname(socket.gethostname())} pyspark-shell'
hl.init(app_name="testhail")
mt = hl.balding_nichols_model(n_populations=3, n_samples=50, n_variants=100)
mt.show(2)
from pyspark.sql import SparkSession
spark=SparkSession(hl.spark_context())
df = spark.createDataFrame([
(1, 144.5, 5.9, 33, 'M'),
(2, 167.2, 5.4, 45, 'M'),
(3, 124.1, 5.2, 23, 'F'),
(4, 144.5, 5.9, 33, 'M'),
(5, 133.2, 5.7, 54, 'F'),
(3, 124.1, 5.2, 23, 'F')])
df.show()
注意两点 1、当前目录的jars路径下 需要有hail-all-spark.jar包。 2、使用的image镜像spark.kubernetes.container.image=123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/spark-py 是之前创建的python版本的spark服务端镜像,里面的python版本是3.6,要与我们 程序中使用的python版本对应。
运行python脚本
python3.6 test.py
运行成功输出如下:
[root@spark-client-test-zzq spark-2.4.4-bin-hadoop2.7]# python3.6 test.py
/usr/local/lib/python3.6/site-packages/pandas/compat/__init__.py:85: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
20/02/22 05:09:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.4.1
SparkUI available at http://10.30.24.164:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.32-a5876a0a2853
LOGGING: writing to /spark/spark-2.4.4-bin-hadoop2.7/hail-20200222-0509-0.2.32-a5876a0a2853.log
2020-02-22 05:09:22 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 100 variants...
[Stage 0:============================================> (6 + 1) / 8]2020-02-22 05:09:24 Hail: INFO: Coerced sorted dataset
+---------------+------------+------+------+------+------+------+------+------+------+
| locus | alleles | 0.GT | 1.GT | 2.GT | 3.GT | 4.GT | 5.GT | 6.GT | 7.GT |
+---------------+------------+------+------+------+------+------+------+------+------+
| locus<GRCh37> | array<str> | call | call | call | call | call | call | call | call |
+---------------+------------+------+------+------+------+------+------+------+------+
| 1:1 | ["A","C"] | 0/1 | 1/1 | 0/1 | 0/1 | 0/0 | 0/0 | 1/1 | 1/1 |
| 1:2 | ["A","C"] | 1/1 | 1/1 | 0/1 | 1/1 | 0/1 | 1/1 | 0/1 | 0/1 |
+---------------+------------+------+------+------+------+------+------+------+------+
showing top 2 rows
showing the first 8 of 50 columns
[Stage 3:> (0 + 1) / 1]+---+-----+---+---+---+
| _1| _2| _3| _4| _5|
+---+-----+---+---+---+
| 1|144.5|5.9| 33| M|
| 2|167.2|5.4| 45| M|
| 3|124.1|5.2| 23| F|
| 4|144.5|5.9| 33| M|
| 5|133.2|5.7| 54| F|
| 3|124.1|5.2| 23| F|
+---+-----+---+---+---+
[root@spark-client-test-zzq spark-2.4.4-bin-hadoop2.7]#
可能遇到的问题—ModuleNotFoundError: No module named ‘hail’
找不到hail的module,需要确认是否有进行安装,以及安装的路径是否是当前python程序使用的版本。 安装hail,使用命令
pip3 install hail
注意,如果python命令默认对应了python3,则可以使用
pip install hail
如果不确定hail安装在哪个版本可以查询一下 使用命令
whereis python
然后进入到lib中site-package目录中查找
ls site-package |grep hail
确定hail安装的python版本后,可以在python程序中指定使用的版本,例如
#!/usr/bin/env python3.6
可能遇到问题-- pkg_resources.DistributionNotFound: The ‘wheel>=0.25.0’ distribution was not found and is required by pypandoc
下载不到wheel,可以尝试先单独按先wheel然后再按先安装hail
pip3 install wheel
pip3 install hail
可优化----制作python3.7的镜像
需要注意的是hail目前的版本0.2.32适应的python版本是3.7,也就是说 我们的客户端python版本以及spark on k8s的spark-py镜像最好都使用python3.7。
所以 我们需要重新制作镜像。制作镜像的步骤参考之前的文章。
客户端镜像
hadoop组件—spark实战----spark on k8s模式k8s原生方式安装spark2.4.4 client mode提交python程序和运行pyspark
下载jre-8u181-linux-x64.tar.gz
放在与Dockerfile和spark-2.4.4-bin-hadoop2.7同一个目录 如下:
zhangxiofansmbp:spark joe$ ls
Dockerfile jre-8u181-linux-x64.tar.gz spark-2.4.4-bin-hadoop2.7
下载hail-all-spark.jar包需要提前放到spark-2.4.4-bin-hadoop2.7安装目录的jars目录中。
我们这里使用rackspacedot/python37 作为基础镜像
使用命令查找python37的基础镜像
zhangxiofansmbp:python joe$ docker search python37
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
rackspacedot/python37 3
rackspacedot/python37-ansible23 1
netsynoteam/python37-requests 0
因为有些python37的镜像没有bash命令,会导致进不了pod。
所以我们在使用镜像前需要确认下有没有bash命令 使用命令校验
docker run -it rackspacedot/python37 /bin/bash
顺利进入到镜像中,可以校验下python的版本 使用命令
root@9e0c8a3044bf:/# python --version
Python 3.7.0
root@9e0c8a3044bf:/# python3 --version
Python 3.7.0
root@9e0c8a3044bf:/# python3.7 --version
Python 3.7.0
root@9e0c8a3044bf:/#
这个镜像默认的python命令并已经直接指向python3.7
否则在Dockfile中需要设置两个 后续client 需要的环境变量
ENV PYSPARK_PYTHON python3.7
ENV PYSPARK_DRIVER_PYTHON python3.7
开始创建Dockerfile 使用命令
zhangxiofansmbp:spark joe$ ls
spark-2.4.4-bin-hadoop2.7 spark-2.4.4-bin-hadoop2.7.tgz
zhangxiofansmbp:spark joe$ vi Dockerfile
输入内容:
FROM rackspacedot/python37
WORKDIR /spark
COPY spark-2.4.4-bin-hadoop2.7 /spark/spark-2.4.4-bin-hadoop2.7
RUN env
#复杂dockerfile同目录下的jre包
COPY jre-8u181-linux-x64.tar.gz /spark/jre-8u181-linux-x64.tar.gz
#创建存放jre的目录
RUN mkdir -p /spark/java
#解压
RUN tar xvf /spark/jre-8u181-linux-x64.tar.gz -C /spark/java
RUN rm -rf /spark/jre-8u181-linux-x64.tar.gz
RUN env
RUN pip install hail
#设置环境变量
ENV JAVA_HOME /spark/java/jre1.8.0_181
ENV JRE_HOME /spark/java/jre1.8.0_181
ENV CLASSPATH $JAVA_HOME/lib/:$JRE_HOME/lib/
ENV PATH $PATH:$JAVA_HOME/bin
ENV PYSPARK_PYTHON python3.7
ENV PYSPARK_DRIVER_PYTHON python3.7
ENV SPARK_HOME /spark/spark-2.4.4-bin-hadoop2.7
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
RUN env
ENTRYPOINT [ "" ]
注意,hail-all-spark.jar包需要提前放到spark-2.4.4-bin-hadoop2.7安装目录的jars目录中。
构建镜像使用命令
docker build -t spark-client-py37-java:2.4.4 .
docker tag spark-client-py37-java:2.4.4 <repo>/spark-client-py37-java:2.4.4
docker push <repo>/spark-client-py37-java:2.4.4
上传成功后 我们就得到了一个client的镜像地址,如: 123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark-client-py37-java:2.4.4
创建client pod 在运行pyspark之前,我们还是需要 先创建一个client pod。
尤其注意 这个pod需要有 headless-service。
使用的yaml如下:
[zzq@localhost sparktest]$ cat spark-client.yaml
apiVersion: v1
kind: Pod
metadata:
name: spark-client-test
labels:
name: spark-client-test
spec:
hostname: spark-client-test
containers:
- name: spark-client-test
image: 123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark-client-py37-java:2.4.4
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 512Mi
command: ["/bin/sh","-c"]
args: ['ls /spark;date;sleep 10m;date']
---
apiVersion: v1
kind: Service
metadata:
name: spark-client-test
spec:
clusterIP: None
selector:
name: spark-client-test
ports:
- protocol: TCP
port: 8888
targetPort: 8888
运行命令创建pod
kubectl create -f spark-client.yaml
使用命令查看pod状态
[zzq@localhost sparktest]$ kubectl get pods|grep spark
spark-client-test 1/1 Running 0 68s
当pod的状态为running时,我们就可以进入pod中运行spark的命令了。
使用命令进入pod
kubectl exec -ti spark-client-test -- bash
进入pod后查看目录中的文件是否正常,java环境和python环境是否ok
[zzq@localhost sparktest]$ kubectl exec -ti spark-client-test -- bash
root@spark-client-test-c9fbbc45d-46j9t:/spark#
root@spark-client-test-c9fbbc45d-46j9t:/spark#
root@spark-client-test-c9fbbc45d-46j9t:/spark# ls
spark-2.4.4-bin-hadoop2.7
root@spark-client-test-c9fbbc45d-46j9t:/spark# cd spark-2.4.4-bin-hadoop2.7/
root@spark-client-test-c9fbbc45d-46j9t:/spark/spark-2.4.4-bin-hadoop2.7# ls
LICENSE NOTICE R README.md RELEASE bin conf data examples jars kubernetes licenses python sbin yarn
root@spark-client-test-c9fbbc45d-46j9t:/spark/spark-2.4.4-bin-hadoop2.7#
[root@spark-client-test spark]# java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
[root@spark-client-test spark]# python --version
Python 3.7.0
[root@spark-client-test spark]# python3 --version
Python 3.7.0
[root@spark-client-test spark]# python3.7 --version
Python 3.7.0
服务端镜像
hadoop组件—spark实战----spark on k8s模式k8s原生方式安装spark2.4.4 cluster mode
原镜像文件/spark-2.4.4-bin-hadoop2.7/kubernetes/dockerfiles/spark/bindings/python/Dockerfile 内容如下:
ARG base_img
FROM $base_img
WORKDIR /
RUN mkdir ${SPARK_HOME}/python
# TODO: Investigate running both pip and pip3 via virtualenvs
RUN apk add --no-cache python && \
apk add --no-cache python3 && \
python -m ensurepip && \
python3 -m ensurepip && \
# We remove ensurepip since it adds no functionality since pip is
# installed on the image and it just takes up 1.6MB on the image
rm -r /usr/lib/python*/ensurepip && \
pip install --upgrade pip setuptools && \
# You may install with python3 packages by using pip3.6
# Removed the .cache to save space
rm -r /root/.cache
COPY python/lib ${SPARK_HOME}/python/lib
ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip
WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]
修改为:
ARG base_img
FROM $base_img
WORKDIR /
RUN mkdir ${SPARK_HOME}/python
# TODO: Investigate running both pip and pip3 via virtualenvs
RUN apk add --no-cache python && \
apk add --no-cache python=3.7 && \
python -m ensurepip && \
python3.7 -m ensurepip && \
rm -r /usr/lib/python*/ensurepip && \
pip install --upgrade pip setuptools && \
rm -r /root/.cache
COPY python/lib ${SPARK_HOME}/python/lib
ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip
WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]
使用命令
cd ./spark-2.4.4-bin-hadoop2.7
./bin/docker-image-tool.sh -r <repo> -t my-tag build
./bin/docker-image-tool.sh -r <repo> -t my-tag push
如果对apk add不熟悉也可以使用spark2.4.5的版本 里面是使用pip安装的。
标签:bin,python,hail,client,数据挖掘,spark,2.4 From: https://blog.51cto.com/u_16218512/7013777