首页 > 其他分享 >传统部署HadoopSpark与容器化参考

传统部署HadoopSpark与容器化参考

时间:2022-09-26 17:56:00浏览次数:52  
标签:HadoopSpark 参考 容器 root hadoop sh && spark soft

hadoop-spark

  • 搭建过程参考网上的文档与解决问题的文章地址
https://www.cnblogs.com/luo630/p/13271637.html
https://www.cnblogs.com/dintalk/p/12234718.html
https://blog.csdn.net/qq_34319644/article/details/115555522
https://blog.csdn.net/LZB_XM/article/details/125306125
https://blog.csdn.net/wild46cat/article/details/53731703
https://blog.csdn.net/qq_38712932/article/details/84197154

hadoop部署

  • 清华源稳定版下载地址
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable/
  • Hadoop部署过程
#**创建软件解压目录并授权**
[root@hadoop ~]# ll
total 856152
-rw-r--r-- 1 root   root   695457782 Jul 30 03:11 hadoop-3.3.4.tar.gz
-rw-r--r-- 1 root   root   181238643 Jun  9 23:20 jdk-8u60-linux-x64.tar.gz
drwxr-xr-x 2 hadoop hadoop         6 Sep 22 18:35 soft

#创建目录
mkdir soft
#解压软件包
tar xf hadoop-3.3.4.tar.gz -C app/
tar xf jdk-8u60-linux-x64.tar.gz -C app/
#设置软连接
ln -s /root/app/hadoop-3.3.4/bin/hadoop /usr/bin/
ln -s /root/app/jdk1.8.0_60/bin/java /usr/bin/
  • 解压jdk并配置软连接路径
#jdk使用的Oracle-jdk 二进制解决即用 不推荐openjdk 生产上使用就使用Oracle吧
[root@hadoop soft]# pwd
/root/soft

#解压
tar xf jdk-8u60-linux-x64.tar.gz -C soft/
#软连接
ln -s jdk1.8.0_60 jdk
  • 设置Java环境变量
[root@hadoop soft]# vim /etc/profile
[root@hadoop soft]# tail -3 /etc/profile
JAVA_HOME=/root/soft/jdk/
PATH=$PATH:$JAVA_HOME/bin
[root@hadoop soft]# source /etc/profile
[root@hadoop ~]# java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
  • 解压hadoop并设置软链接路径
[root@hadoop ~]# tar xf hadoop-3.3.4.tar.gz -C soft/
[root@hadoop soft]# ln -s /root/soft/hadoop-3.3.4/ /root/soft/hadoop
[root@hadoop soft]# ll
total 0
lrwxrwxrwx  1 root root  24 Sep 22 18:46 hadoop -> /root/soft/hadoop-3.3.4/
drwxr-xr-x 10 1024 1024 215 Jul 29 22:44 hadoop-3.3.4
lrwxrwxrwx  1 root root  11 Sep 22 18:39 jdk -> jdk1.8.0_60
drwxr-xr-x  8   10  143 266 Sep 22 18:39 jdk1.8.0_60
  • 设置hadoop环境变量
#这里有对hadoop环境变量脚本中增加环境变量指定jdk

[root@hadoop soft]# vim /etc/profile
[root@hadoop soft]# tail -7 /etc/profile
JAVA_HOME=/root/soft/jdk/
PATH=$PATH:$JAVA_HOME/bin
HADOOP_HOME=/root/soft/hadoop/
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
[root@hadoop soft]# source /etc/profile
#编辑指定这个这个环境变量  填写
[root@hadoop ~]# grep JAVA_HOME /root/soft/hadoop/etc/hadoop/hadoop-env.sh | grep -v ^#
export JAVA_HOME=/root/soft/jdk/
  • 将start-dfs.sh,stop-dfs.sh两个文件顶部添加以下参数
#如果启动的用户不对就改成对应的用户--第一次搭建就用root吧
[root@hadoop ~]# /root/soft/hadoop/sbin/start-dfs.sh
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
  • 将start-yarn.sh,stop-yarn.sh两个文件添加环境变量
[root@hadoop sbin]# cat /root/soft/hadoop/sbin/start-yarn.sh
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HADOOP_SECURE_DN_USER=yarn
  • ssh配置去掉注释,并改为no
[root@hadoop sbin]# vim /etc/ssh/ssh_config
StrictHostKeyChecking no
  • hadoop设置免密登录
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

#测试-不需要密码就是对的一定要 ssh localhost能够免密登录 这是hadoop之间要进行通信
ssh localhost
  • 最后还发现还缺少一个日志目录
[root@hadoop sbin]# touch /root/soft/hadoop-3.3.4/logs
  • 配置nodename dataname
#这里是配置hadoop的web页面主要是这个端口50070
#9000应该是内部的通信端口

[root@hadoop ~]# cat soft/hadoop/etc/hadoop/core-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

<property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
</property>
<property>
      <name>hadoop.tmp.dir</name>
      <value>/root/soft/hadoop/data/tmp/</value>
</property>

</configuration>

[root@hadoop ~]# cat soft/hadoop/etc/hadoop/hdfs-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
      <name>dfs.namenode.http-address</name>
      <value>0.0.0.0:50070</value>
      <description>A base for other temporary directories.</description>
</property>

</configuration>
  • 验证是否安装完毕(注意,提交的目录当前用户需要有权限,因为本地部署不需要启动服务,它用的就是Linux操作系统,如果普通用户把文件直接提交到根的话肯定会报异常的哟!)
[root@hadoop ~]# hdfs dfs -put hadoop-3.3.4.tar.gz /home/hadoop/test
  • hdfs格式化-就第一次安装或使用格式化一下
[root@hadoop ~]# ./soft/hadoop/bin/hdfs namenode -format
  • 启动hadoop
[root@hadoop ~]# ./soft/hadoop/sbin/start-all.sh
  • 运行hadoop实例(运行了一个自带的实例)
[root@hadoop ~]# hadoop jar ~/soft/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar p1 10 10
  • hadoop信息页

  • hadoop页面

scala部署

  • 部署流程:下载解压环境变量
[root@hadoop soft]# wget https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
[root@hadoop ~]# tar xf scala-2.11.12.tgz -C soft/

[root@hadoop soft]# cat /etc/profile
export SCALA_HOME=/data/soft/scala
export PATH=$SCALA_HOME/bin:$PATH

spark部署

  • 部署流程
  • 下载解压改名
[root@hadoop ~]# wget --no-check-certificate https://dlcdn.apache.org/spark/spark-3.2.2/spark-3.2.2-bin-hadoop3.2.tgz

[root@hadoop ~]# tar xf spark-3.2.2-bin-hadoop3.2.tgz -C soft/
[root@hadoop soft]# mv spark-3.2.2-bin-hadoop3.2 spark
  • 配置环境变量
[root@hadoop soft]# cat /etc/profile
export SPARK_HOME=/root/soft/spark 
export PATH=$PATH:$SPARK_HOME/bin
  • 脚本增加环境变量
[root@hadoop conf]# cat spark-env.sh
export JAVA_HOME=/root/soft/jdk/
PATH=$PATH:$JAVA_HOME/bin

HADOOP_HOME=/root/soft/hadoop/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export SPARK_HOME=/root/soft/spark
export PATH=$PATH:$SPARK_HOME/bin
export SPARK_MASTER_PORT=7077

export SCALA_HOME=/root/soft/scala
export PATH=$SCALA_HOME/bin:$PATH

  • 修改脚本端口(怕你宿主机上已经使用了8080这里可以改也可以不改)
[root@hadoop ~]# cat ./soft/spark/sbin/start-master.sh 
#原先是8080  改成8081
if [ "$SPARK_MASTER_WEBUI_PORT" = "" ]; then
  SPARK_MASTER_WEBUI_PORT=8081
fi
  • spark配置历史监听端口---没研究明白可以不用整可以忽略这里
[root@hadoop conf]# pwd
/root/soft/spark/conf
[root@hadoop conf]# cp spark-defaults.conf.template spark-defaults.conf
[root@hadoop conf]# cat spark-defaults.conf
# Example:
spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.history.ui.port            18080
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
  • spark启动
[root@hadoop ~]# ./soft/spark/sbin/start-all.sh
  • spark运行自带测试实例
./soft/spark/bin/spark-submit run-example --master spark://10.0.0.100:7077 sql.SparkSQLExample
  • spark-shell启动(这是一个交互式页面)
[root@hadoop ~]# ./soft/spark/bin/spark-shell
  • 测试
#简单测试了一下这是获取某个目录下的个数应该是这样的

[root@hadoop ~]# ./soft/spark/bin/spark-shell
scala> val line = sc.textFile("/wordcountinput/123")
line: org.apache.spark.rdd.RDD[String] = /wordcountinput/123 MapPartitionsRDD[1] at textFile at <console>:23

scala> val line= sc.textFile("file:///home/spark/conf/spark-env.sh")
line: org.apache.spark.rdd.RDD[String] = file:///home/spark/conf/spark-env.sh MapPartitionsRDD[3] at textFile at <console>:23

scala> :quit

所有启动测试

如果上面的都安装完成了那么久安照这个顺序执行一下吧

  • hdfs格式化-第一次就初始化把-以后就不用了应该
./soft/hadoop/bin/hdfs namenode -format
  • 启动hadoop
./soft/hadoop/sbin/start-all.sh
  • spark启动
./soft/spark/sbin/start-all.sh
  • spark启动-sell
./soft/spark/bin/spark-shell
  • spark-shell测试
val textFile = spark.read.textFile("hdfs://localhost:9000/input/README.txt")
  • spark测试
spark-submit --class org.apache.spark.examples.SparkPi --master local[*] spark-examples-1.6.3-hadoop2.4.0.jar 100
  • 问题:如果遇到初始化的时候没有nodename与dataname
  • 问题已经解决了:在上述配置中可以看到
#解决
那就是删除/tmp/目录下的所有数据
然后找到这个xml文件添加东西
#参考
[root@hadoop hadoop]# cat core-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

<property>
      <name>hadoop.tmp.dir</name>
      <value>/root/soft/hadoop/hadoop_tmp</value>
      <description>A base for other temporary directories.</description>
</property>

</configuration>

最终所有效率页面图

  • hadoop:8088

  • hdfs:50070

  • spark-shell:

  • spark-shell:4040

  • spark:8081

容器化hadoop与spark

  • 文档参考
https://github.com/bambrow/docker-hadoop-workbench
  • 命令参考
docker build -t hadoop-spark .

docker run -it  --privileged=true hadoop-spark

docker run -d --name hadoop-spark  -p 8088:8088 -p 8080:8080 -p 50070:50070 -p 4040:4040  --restart=always   hadoop-spark:latest

Dockerfile

如果你的docker版本低那么就注释掉健康检查机制

最好是1.17以上版本,如果不满足健康检查机制里有参数会不支持

[root@hadoop-docker Container]# cat Dockerfile 
FROM centos:7.9.2009
LABEL auther=QuYi hadoop=3.3.4 jdk=1.8 scala=2.11.12 spark=3.2.2

#默认工作目录
WORKDIR /root/soft/

#拷贝文件安装软件
COPY CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo
COPY epel.repo /etc/yum.repos.d/epel.repo


#软件拷贝并解压
ADD hadoop-3.3.4.tar.gz           /root/soft/
ADD jdk-8u60-linux-x64.tar.gz     /root/soft/
ADD scala-2.11.12.tgz             /root/soft/
ADD spark-3.2.2-bin-hadoop3.2.tgz /root/soft/

#设置软链接方便操作
RUN     mv /root/soft/hadoop-3.3.4 /root/soft/hadoop \
    &&  mv /root/soft/jdk1.8.0_60 /root/soft/jdk \
    &&  mv /root/soft/scala-2.11.12 /root/soft/scala \
    &&  mv /root/soft/spark-3.2.2-bin-hadoop3.2 /root/soft/spark \  
    &&  ln -s /root/soft/hadoop/bin/hadoop /usr/bin/ \
    &&  ln -s /root/soft/jdk/bin/java  /usr/bin/ \
    &&  ln -s /root/soft/scala/bin/scala /usr/bin/ \
    &&  ln -s /root/soft/spark/bin/spark /usr/bin/
 
#hadoop与jdk的环境变量
ENV JAVA_HOME="/root/soft/jdk/"
ENV PATH="$PATH:$JAVA_HOME/bin"
ENV HADOOP_HOME="/root/soft/hadoop/"
ENV PATH="$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" 
ENV JAVA_LIBRARY_PATH="/root/soft/hadoop/lib/native"
#spark的环境变量
ENV SPARK_HOME="/root/soft/spark"
ENV PATH="$PATH:$SPARK_HOME/bin"
#scala的环境变量
ENV SCALA_HOME="/root/soft/scala"
ENV PATH="$SCALA_HOME/bin:$PATH" 

#对hadoop-start-dfs.sh文件插入环境变量
RUN    sed -i '2iexport HDFS_NAMENODE_USER=root' /root/soft/hadoop/sbin/start-dfs.sh \
    && sed -i '3iexport HDFS_DATANODE_USER=root' /root/soft/hadoop/sbin/start-dfs.sh \
    && sed -i '4iexport HDFS_SECONDARYNAMENODE_USER=root' /root/soft/hadoop/sbin/start-dfs.sh \
    && sed -i '5iexport YARN_RESOURCEMANAGER_USER=root' /root/soft/hadoop/sbin/start-dfs.sh \
    && sed -i '6iexport YARN_NODEMANAGER_USER=root' /root/soft/hadoop/sbin/start-dfs.sh
#对hadoop-stop-dfs.sh文件插入环境变量

RUN    sed -i '2iexport HDFS_NAMENODE_USER=root' /root/soft/hadoop/sbin/stop-dfs.sh \
    && sed -i '3iexport HDFS_DATANODE_USER=root' /root/soft/hadoop/sbin/stop-dfs.sh \
    && sed -i '4iexport HDFS_SECONDARYNAMENODE_USER=root' /root/soft/hadoop/sbin/stop-dfs.sh \
    && sed -i '5iexport YARN_RESOURCEMANAGER_USER=root' /root/soft/hadoop/sbin/stop-dfs.sh \
    && sed -i '6iexport YARN_NODEMANAGER_USER=root' /root/soft/hadoop/sbin/stop-dfs.sh

#对应hadoop的环境变量脚本添加jdk环境变量
RUN  sed -i '2iexport JAVA_HOME=/root/soft/jdk/' /root/soft/hadoop/etc/hadoop/hadoop-env.sh 

#对hadoop-start-yarn.sh添加环境变量
RUN    sed -i '2iexport HDFS_NAMENODE_USER=root' /root/soft/hadoop/sbin/start-yarn.sh \
    && sed -i '3iexport HDFS_DATANODE_USER=root' /root/soft/hadoop/sbin/start-yarn.sh \
    && sed -i '4iexport HDFS_SECONDARYNAMENODE_USER=root' /root/soft/hadoop/sbin/start-yarn.sh \
    && sed -i '5iexport YARN_RESOURCEMANAGER_USER=root' /root/soft/hadoop/sbin/start-yarn.sh \
    && sed -i '6iexport YARN_NODEMANAGER_USER=root' /root/soft/hadoop/sbin/start-yarn.sh \
    && sed -i '7iexport HADOOP_SECURE_DN_USER=yarn' /root/soft/hadoop/sbin/start-yarn.sh

#对hadoop-stop-yarn.sh添加环境变量

RUN    sed -i '2iexport HDFS_NAMENODE_USER=root' /root/soft/hadoop/sbin/stop-yarn.sh \
    && sed -i '3iexport HDFS_DATANODE_USER=root' /root/soft/hadoop/sbin/stop-yarn.sh  \
    && sed -i '4iexport HDFS_SECONDARYNAMENODE_USER=root' /root/soft/hadoop/sbin/stop-yarn.sh \
    && sed -i '5iexport YARN_RESOURCEMANAGER_USER=root' /root/soft/hadoop/sbin/stop-yarn.sh \
    && sed -i '6iexport YARN_NODEMANAGER_USER=root' /root/soft/hadoop/sbin/stop-yarn.sh \
    && sed -i '7iexport HADOOP_SECURE_DN_USER=yarn' /root/soft/hadoop/sbin/stop-yarn.sh

#对spark增加环境变量对spark修改脚本端口
RUN    sed -i '2iexport JAVA_HOME=/root/soft/jdk/' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '3iPATH=$PATH:$JAVA_HOME/bin' /root/soft/spark/conf/spark-env.sh.template  \
    && sed -i '4iHADOOP_HOME=/root/soft/hadoop/' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '5iexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '6iPATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '7iexport SPARK_HOME=/root/soft/spark' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '8iexport PATH=$PATH:$SPARK_HOME/bin' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '9iexport SPARK_MASTER_PORT=7077' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '10iexport SCALA_HOME=/root/soft/scala' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i '11iexport PATH=$SCALA_HOME/bin:$PATH' /root/soft/spark/conf/spark-env.sh.template \
    && sed -i 's#SPARK_MASTER_WEBUI_PORT=8080#SPARK_MASTER_WEBUI_PORT=8081#g' /root/soft/spark/conf/spark-env.sh.template




#hadoop设置免密登录-并创建一个日志文件
RUN    yum install -y  openssh openssh-clients openssh-server iproute initscripts nc  \   
    && /usr/sbin/sshd-keygen -A \ 
    && /usr/sbin/sshd  \
    && /usr/bin/ssh-keygen -t rsa -P '' -f /root/.ssh/id_rsa   \
    && cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys \
    && ssh-copy-id -i /root/.ssh/id_rsa.pub -o StrictHostKeyChecking=no root@localhost \
    && chmod 700 /root/.ssh \
    && chmod 600 /root/.ssh/authorized_keys \
    && chmod og-wx /root/.ssh/authorized_keys \
    && sed -i '4iStrictHostKeyChecking no' /etc/ssh/ssh_config \
    && sed -i '3iPort 22' /etc/ssh/sshd_config  \
    && /usr/sbin/sshd  \
    && mkdir -p  /root/soft/hadoop/logs \
    && echo `hostname -I` 'localhost'  >> /etc/hosts


#拷贝hadoop项目文件
COPY   core-site.xml /root/soft/hadoop/etc/hadoop/ 
COPY   hdfs-site.xml /root/soft/hadoop/etc/hadoop/

#hadoop 拷贝启动脚本
COPY entrypoint.sh /

#给启动脚本设置权限
RUN    chmod 777 /entrypoint.sh 

#端口暴露 hadoop 8088 HDFS 50070 spark 8081 spark-shell 4044 其它属于内部通信端口

EXPOSE 8040 9864 9000 8042 9866 9867 9868 33389 50070 8088 8030 36638 8031 8032 8033 7077 41904 8081 8082 4044

HEALTHCHECK  --interval=5s \
             --timeout=3s \
             --start-period=30s \
             --retries=3 \
             CMD  echo test | nc localhost 8088 || exit 1

CMD ["/entrypoint.sh"]

目录结构

[root@oss-server Container-docker]# ll
total 1178676
-rw-r--r-- 1 root root      2523 Aug  4 15:04 CentOS-Base.repo
-rw-r--r-- 1 root root       982 Sep 25 16:01 core-site.xml
-rw-r--r-- 1 root root      6285 Sep 26 16:29 Dockerfile
-rw-r--r-- 1 root root       945 Sep 26 16:55 docker-hadoopspark.sh
-rwxr-xr-x 1 root root       648 Sep 26 15:44 entrypoint.sh
-rw-r--r-- 1 root root       664 Aug  4 15:04 epel.repo
-rw-r--r-- 1 root root 695457782 Sep 26 10:54 hadoop-3.3.4.tar.gz
-rw-r--r-- 1 root root       952 Sep 25 15:23 hdfs-site.xml
-rw-r--r-- 1 root root 181238643 Sep 26 10:55 jdk-8u60-linux-x64.tar.gz
-rw-r--r-- 1 root root  29114457 Sep 26 10:55 scala-2.11.12.tgz
-rw-r--r-- 1 root root 301112604 Sep 26 10:57 spark-3.2.2-bin-hadoop3.2.tgz

做个记录命令脚本

[root@oss-server Container-docker]# cat docker-hadoopspark.sh 
#!/bin/bash

#==================================Spark==================================#

#启动容器命令-没有做数据持久化
docker run -d --name HadoopSpark  -p 8088:8088 -p 8080:8080 -p 50070:50070 -p 4040:4040  --restart=always   hadoop-spark:latest

#spark-shell交互式数据处理

#第一步连接进去容器
docker exec -it HadoopSpark bash

#启动spark-shell
/root/soft/spark/bin/spark-shell

#==================================Hadoop==================================#

#第一步连接进去容器
docker exec -it HadoopSpark bash

#要是第一次使用hadoop的话可以初始化一下HDFS (这里也可以不用管做一个记录)
/root/soft/hadoop/bin/hdfs namenode -format

#启动Hadoop
/root/soft/hadoop/sbin/start-all.sh

#==================================Scala==================================#

#启动-(如果使用上面的Spark的话就不需要使用Scala-做个记录)

标签:HadoopSpark,参考,容器,root,hadoop,sh,&&,spark,soft
From: https://www.cnblogs.com/yidadasre/p/16731815.html

相关文章