一、Hadoop 简介
1.1 Hadoop3核心组件
HDFS:分布式文件系统:解决海量数据存储
YARN:集群资源管理和任务调度框架:解决资源任务调度
MapReduce:分布式计算框架:解决海量数据计算
1.2 Hadoop集群简介
Hadoop集群包括两个集群:HDFS YARN
两个集群 逻辑上分离(互不影响、互不依赖) 物理上一起(部分进程在一台服务器上)
默认为主从架构
1.2.1 HDFS
主角色:NameNode(NN)
从角色:DataNode(DN)
主角色辅助角色:SecondrayNameNode(SNN)
1.2.2 YARN
主角色:ResourceManager(RM)
从角色:NodeManager(NM)
二、环境信息及准备
2.1 机器及机器角色规划
2.2 节点添加hosts解析
192.168.1.131 hdp01.dialev.com
192.168.1.132 hdp02.dialev.com
192.168.1.133 hdp03.dialev.com
2.3 关闭防火墙
2.4 hdp01到三台机器免密
echo "StrictHostKeyChecking no" >~/.ssh/config
ssh-copy-id -i 192.168.1.13{1..3}
2.5 时间同步
yum -y install ntpdate
ntpdate ntp.aliyun.com
echo '*/5 * * * * ntpdate ntp.aliyun.com 2>&1' >> /var/spool/cron/root
2.6 调大用户文件描述符
vim /etc/security/limits.conf
* soft nofile 65535
* hard nofile 65535
# 配置需要重启才能生效
2.7 安装Java环境
tar xf jdk-8u65-linux-x64.tar.gz -C /usr/local/
cd /usr/local/
ln -sv jdk1.8.0_65/ java
vim /etc/profile.d/java.sh
export JAVA_HOME=/usr/local/java
export CLASSPATH=$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
source /etc/profile.d/java.sh
java -version
三、安装Hadoop
3.1 解压安装包
此篇文档及Hadoop相关文档相关软件包统一在此百度网盘:
链接:https://pan.baidu.com/s/11F4THdIfgrULMn2gNcObRA?pwd=cjll
# https://archive.apache.org/dist/hadoop/common/ 也可以根据实际部署版本下载
tar xf hadoop-3.1.4.tar.gz -C /usr/local/
cd /usr/local/
ln -sv hadoop-3.1.4 hadoop
目录结构:
├── bin Hadoop最基本的管理脚本和使用脚本的目录。
├── etc Hadoop配置文件目录
├── include 编程库头文件,用于C++程序访问HDFS或者编写MapReduce
├── lib Hadoop对外提供的编程动态和静态库
├── libexec 各个服务对外用的shell配置文件目录,可用于日志输出,启动参数等基本信息
├── sbin Hadoop管理脚本所在目录,主要保护HDFS和YARN中各类服务的启动/关闭脚本
└── share Hadoop各个模块编译后的jar包目录,官方自带示例
3.2 修改Hadoop环境配置变量
https://hadoop.apache.org/docs/r3.1.4/ #Configuration 章节,左侧最下方
cd etc/hadoop
cp hadoop-env.sh hadoop-env.sh-bak
vim hadoop-env.sh
export JAVA_HOME=/usr/local/java
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
3.3 指定集群默认配置
cp core-site.xml core-site.xml-bak
vim core-site.xml
<configuration>
<!-- 指定HDFS老大(namenode)的通信地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hdp01.dialev.com:8020</value>
</property>
<!-- 指定hadoop运行时产生文件的存储路径 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/Hadoop/tmp</value>
</property>
<!-- 设置HDFS web UI访问用户 -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
</property>
</configuration>
3.4 修改SNN配置
cp hdfs-site.xml hdfs-site.xml-bak
vim hdfs-site.xml
<configuration>
<!-- 设置namenode的http通讯地址 -->
<property>
<name>dfs.namenode.http-address</name>
<value>hdp01.dialev.com:50070</value>
</property>
<!-- 设置secondarynamenode的http通讯地址 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hdp02.dialev.com:50090</value>
</property>
<!-- 设置namenode存放的路径 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/Hadoop/name</value>
</property>
<!-- 设置hdfs副本数量 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- 设置datanode存放的路径,如果不指定则使用hadoop.tmp.dir所指路径 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/Hadoop/data</value>
</property>
</configuration>
3.5 MapReduce配置
cp mapred-site.xml mapred-site.xml-bak
vim mapred-site.xml
<configuration>
<!-- 通知框架MR使用YARN -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- MR App Mater 环境变量 -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<!-- MR Map Task 环境变量 -->
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<!-- MR Reduce Task 环境变量 -->
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
3.6 YARN配置
cp yarn-site.xml yarn-site.xml-bak
vim yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hdp01.dialev.com</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 是否将对容器实施物理内存限制 -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!-- 是否将对容器实施虚拟内存限制。 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 开启日志聚集 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 保存的时间7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
3.7 配置从角色机器地址
vim workers
hdp01.dialev.com
hdp02.dialev.com
hdp03.dialev.com
8.同步集群配置
scp -r -q hadoop-3.1.4 192.168.1.132:/usr/local/
scp -r -q hadoop-3.1.4 192.168.1.133:/usr/local/
# 2 3 两个节点创建软连接
ln -sv hadoop-3.1.4 hadoop
四、启动Hadoop
4.1 初始化名称节点
在hdp01.dialev.com上执行,仅此一次,误操作可以删除初始化目录
hdfs namenode -format
......
2022-12-26 16:40:03,355 INFO util.GSet: 0.029999999329447746% max memory 940.5 MB = 288.9 KB
2022-12-26 16:40:03,355 INFO util.GSet: capacity = 2^15 = 32768 entries
2022-12-26 16:40:03,406 INFO namenode.FSImage: Allocated new BlockPoolId: BP-631728325-192.168.1.131-1672044003397
2022-12-26 16:40:03,437 INFO common.Storage: Storage directory /Hadoop/name has been successfully formatted. # /Hadoop/name初始化目录,这行信息表明对应的存储已经格式化成功。
2022-12-26 16:40:03,498 INFO namenode.FSImageFormatProtobuf: Saving image file /Hadoop/name/current/fsimage.ckpt_0000000000000000000 using no compression
2022-12-26 16:40:03,781 INFO namenode.FSImageFormatProtobuf: Image file /Hadoop/name/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds .
2022-12-26 16:40:03,802 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2022-12-26 16:40:03,820 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
2022-12-26 16:40:03,821 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hdp01.dialev.com/192.168.1.131
************************************************************/
4.2 启动服务
1.自带sh脚本
cd hadoop/sbin/
HDFS集群:
start-dfs.sh
stop-dfs.sh
YARN集群:
start-yarn.sh
stop-yarn.sh
Hadoop集群(HDFS+YARN):
start-all.sh
stop-all.sh
jps #查看命令结果是否与集群规划一致
/usr/local/hadoop/logs 日志路径,默认安装目录下logs目录
2.手动操作(了解):
HDFS集群
hdfs --daemon start namenode | datanode |secondarynamenode
hdfs --daemon stop namenode | datanode | secondarynamenode
YARN集群
yarn --daemon start resourcemanager | nodemanager
yarn --daemon stop resourcemanager |nodemanager
五、验证
5.1 访问相关web UI
1.打印集群状态
hdfs dfsadmin -report
2.访问YARN的管理界面
http://192.168.1.131:8088/cluster/nodes #RM服务所在主机8088端口
3.访问namenode的管理页面
http://192.168.1.131:50070/dfshealth.html#tab-overview #NN服务所在主机,配置项为 hdfs-site.xml配置文件中dfs.namenode.http-address的值 2.x端口默认为50070 3.x默认为9870 这里50070是因为我没注意新版本改变还是写的2.x配置
5.2 测试创建、上传功能
hadoop fs -mkdir /bowen #在Hadoop根目录下创建一个 bowen 目录
hadoop fs -put yarn-env.sh /bowen #上传文件到bowen目录下
5.3 测试MapReduce执行
cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-3.1.4.jar pi 2 4
Number of Maps = 2
Samples per Map = 4
Wrote input for Map #0
Wrote input for Map #1
Starting Job
2022-12-27 09:20:12,868 INFO client.RMProxy: Connecting to ResourceManager at hdp01.dialev.com/192.168.1.131:8032 #首先会连接到RM上申请资源
2022-12-27 09:20:14,091 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1672045511416_0001
2022-12-27 09:20:14,503 INFO input.FileInputFormat: Total input files to process : 2
2022-12-27 09:20:14,707 INFO mapreduce.JobSubmitter: number of splits:2
2022-12-27 09:20:15,349 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1672045511416_0001
2022-12-27 09:20:15,351 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-12-27 09:20:16,072 INFO conf.Configuration: resource-types.xml not found
2022-12-27 09:20:16,073 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-12-27 09:20:16,974 INFO impl.YarnClientImpl: Submitted application application_1672045511416_0001
2022-12-27 09:20:17,204 INFO mapreduce.Job: The url to track the job: http://hdp01.dialev.com:8088/proxy/application_1672045511416_0001/
2022-12-27 09:20:17,206 INFO mapreduce.Job: Running job: job_1672045511416_0001
2022-12-27 09:20:33,618 INFO mapreduce.Job: Job job_1672045511416_0001 running in uber mode : false
2022-12-27 09:20:33,621 INFO mapreduce.Job: map 0% reduce 0% #MapReduce有两个阶段,分别是 map和reduce
2022-12-27 09:20:47,862 INFO mapreduce.Job: map 100% reduce 0%
2022-12-27 09:20:53,944 INFO mapreduce.Job: map 100% reduce 100%
2022-12-27 09:20:53,968 INFO mapreduce.Job: Job job_1672045511416_0001 completed successfully
......
5.4 集群基准测试
1.写测试:写入10个文件,每个10MB
hadoop jar hadoop-mapreduce-client-jobclient-3.1.4-tests.jar TestDFSIO -write -nrFiles 5 -fileSize 10MB
......
2022-12-27 09:51:12,775 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
2022-12-27 09:51:12,775 INFO fs.TestDFSIO: Date & time: Tue Dec 27 09:51:12 CST 2022
2022-12-27 09:51:12,775 INFO fs.TestDFSIO: Number of files: 5 #文件个数
2022-12-27 09:51:12,775 INFO fs.TestDFSIO: Total MBytes processed: 50 #总大小
2022-12-27 09:51:12,775 INFO fs.TestDFSIO: Throughput mb/sec: 12.49 #吞吐量
2022-12-27 09:51:12,775 INFO fs.TestDFSIO: Average IO rate mb/sec: 15.46 #平均IO速率
2022-12-27 09:51:12,775 INFO fs.TestDFSIO: IO rate std deviation: 7.51 #IO速率标准偏差
2022-12-27 09:51:12,776 INFO fs.TestDFSIO: Test exec time sec: 32.95 #执行时间
2.读测试
hadoop jar hadoop-mapreduce-client-jobclient-3.1.4-tests.jar TestDFSIO -read -nrFiles 5 -fileSize 10MB
......
2022-12-27 09:54:23,826 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
2022-12-27 09:54:23,826 INFO fs.TestDFSIO: Date & time: Tue Dec 27 09:54:23 CST 2022
2022-12-27 09:54:23,826 INFO fs.TestDFSIO: Number of files: 5
2022-12-27 09:54:23,826 INFO fs.TestDFSIO: Total MBytes processed: 50
2022-12-27 09:54:23,826 INFO fs.TestDFSIO: Throughput mb/sec: 94.34
2022-12-27 09:54:23,827 INFO fs.TestDFSIO: Average IO rate mb/sec: 101.26
2022-12-27 09:54:23,827 INFO fs.TestDFSIO: IO rate std deviation: 30.07
2022-12-27 09:54:23,827 INFO fs.TestDFSIO: Test exec time sec: 34.39
3.清理测试数据
hadoop jar hadoop-mapreduce-client-jobclient-3.1.4-tests.jar TestDFSIO -clean
"一劳永逸" 的话,有是有的,而 "一劳永逸" 的事却极少