一、环境搭建
Apache Hadoop下载地址:https://hadoop.apache.org/releases.ht
Cloudera Hadoop下载地址:
Hortonworks Hadoop下载地址:https://hortonworks.com/downloads
解压安装:tar -zxf /opt/software/hadoop-3.1.3.tar.gz -C /opt/module/
配置并生效环境变量
vim /etc/profile.d/my_env.sh
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
| 3. `source /etc/profile.d/my_env.sh`
3. 测试安装结果:`hadoop version`
# 二、环境配置
## 一、Hadoop本地模式
运行wordcount案例
1. 配置服务器本地输入目录及文件,输出目录不能已存在 2. 运行wordcount案例:`hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount 输入目录 输出目录`
> hadoop : bin目录下的命令 > > jar : 参数表示运行一个jar包 > > share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar :官方案例的jar包 > > wordcount :官方案例的jar包中的案例名称
## 二、Hadoop集群模式
此处为完全分布式模式
1. ## 部署规划和环境配置
| | hadoop102 | hadoop103 | hadoop104 | | ---- | -------------------- | ------------------------------ | ----------------------------- | | HDFS | **NameNode**DataNode | DataNode | **SecondaryNameNode**DataNode | | YARN | NodeManager | **ResourceManager**NodeManager | NodeManager |
### 1.2 配置文件
每台节点都需要配置,可执行分发脚本
#### 1.2.1 workers
- 配置群起文件workers(不能有多余空格和空行):`vim ./etc/hadoop/workers`
```Bash hadoop102 hadoop103 hadoop104
|
1.2.2 core-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop102:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop/data</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>wuqiuxu</value> </property> <property> <name>hadoop.proxyuser.wuqiuxu.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.wuqiuxu.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.wuqiuxu.users</name> <value>*</value> </property> <property> <name>fs.trash.interval</name> <value>1</value> </property> </configuration>
|
1.2.3 hdfs-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.namenode.http-address</name> <value>hadoop102:9870</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop104:9868</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.checkpoint.period</name> <value>3600</value> </property> <property> <name>dfs.namenode.checkpoint.txns</name> <value>1000000</value> <description>操作动作次数</description> </property> <property> <name>dfs.namenode.checkpoint.check.period</name> <value>60</value> <description> 1分钟检查一次操作次数</description> </property> <property> <name>dfs.blockreport.intervalMsec</name> <value>21600000</value> <description>Determines block reporting interval in milliseconds.</description> </property> <property> <name>dfs.datanode.directoryscan.interval</name> <value>21600</value> <description>Interval in seconds for Datanode to scan data directories and reconcile the difference between blocks in memory and on the disk. Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval. </description> </property> <property> <name>dfs.namenode.heartbeat.recheck-interval</name> <value>300000</value> </property> <property> <name>dfs.heartbeat.interval</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file://${hadoop.tmp.dir}/dfs/data1,file://${hadoop.tmp.dir}/dfs/data2</value> </property> <property> <name>dfs.hosts</name> <value>/opt/module/hadoop-3.1.3/etc/hadoop/whitelist</value> </property> <property> <name>dfs.hosts.exclude</name> <value>/opt/module/hadoop-3.1.3/etc/hadoop/blacklist</value> </property> </configuration>
|
1.2.4 yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop103</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>true</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property>
<property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://hadoop102:19888/jobhistory/logs</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.resourcemanager.scheduler.client.thread-count</name> <value>8</value> </property> <property> <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name> <value>false</value> </property> <property> <name>yarn.nodemanager.resource.detect-hardware-capabilities</name> <value>false</value> </property> <property> <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name> <value>1.0</value> </property> </configuration>
|
1.2.5 mapred-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop102:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop102:19888</value> </property> <property> <name>mapreduce.job.ubertask.enable</name> <value>true</value> </property> <property> <name>mapreduce.job.ubertask.maxmaps</name> <value>9</value> </property> <property> <name>mapreduce.job.ubertask.maxreduces</name> <value>1</value> </property> <property> <name>mapreduce.job.ubertask.maxbytes</name> <value></value> </property> </configuration>
|
1.2.6 capacity-scheduler.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
| <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,hive</value> <description> The queues at the this level (root is the root queue). </description> </property>
<property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>40</value> </property>
<property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>60</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.capacity</name> <value>60</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.user-limit-factor</name> <value>1</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.maximum-capacity</name> <value>80</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.state</name> <value>RUNNING</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name> <value>*</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name> <value>*</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name> <value>*</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime</name> <value>-1</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.default-application-lifetime</name> <value>-1</value> </property>
|
1.2.7 黑白名单
- 新增白名单:
vim ./etc/hadoop/whitelist
1 2 3 4
| hadoop102 hadoop103 hadoop104 hadoop105
|
- 新增黑名单:
vim ./etc/hadoop/blacklist
1.3 集群时间同步
内网环境下需要配置(需要root权限),选一台节点作为时间服务器,其他节点进行同步
配置时间服务器
关闭ntpt服务
- 临时关闭服务:systemctl stop ntpd
- 关闭开机自启动:systemctl disable ntpd
修改配置文件ntp.conf:vim /etc/ntp.conf
# 修改:授权可同步的节点网段
restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap
# 修改:注释互联网时间
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
# 添加:丢失网络时间后启用本地时间
server 127.127.1.0
fudge 127.127.1.0 stratum 10
1 2 3 4 5 6
| 3. 修改文件ntpd:vim /etc/sysconfig/ntpd
- ```Bash # 同步硬件时间 SYNC_HWCLOCK=yes
|
- 开启ntpd服务
- 临时开启ntpd服务:systemctl start ntpd
- 开启开机自启动:systemctl enable ntpd
配置其他服务器同步时间
- 关闭ntpt服务
- 临时关闭服务:systemctl stop ntpd
- 关闭开机自启动:systemctl disable ntpd
- 配置定时任务以同步时间:crontab -e
# 例:每分钟同步
*/1 * * * * /usr/sbin/ntpdate hadoop102
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| 1. ## 部署规划和环境配置(HA)
| | hadoop102 | hadoop103 | hadoop104 | | ----------- | -------------------------- | -------------------------- | -------------------------- | | HDFS | **NameNode**DataNode | **NameNode**DataNode | **NameNode**DataNode | | YARN | ResourceManagerNodeManager | ResourceManagerNodeManager | ResourceManagerNodeManager | | Zookeeper | Zookeeper | Zookeeper | Zookeeper | | JournalNode | JournalNode | JournalNode | JournalNode | | ZKFC | ZKFC | ZKFC | ZKFC |
以第二节(Hadoop集群模式)中的环境配置为基础,重写相关配置文件
- HDFS HA官方文档:https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html - YARN HA官方文档:https://hadoop.apache.org/docs/r3.1.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html
### 2.1 HA配置文件
#### 2.1.1 core-site.xml
```XML <configuration> <!-- 把多个NameNode的地址组装成一个集群mycluster --> <property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property> <!-- 指定hadoop运行时产生文件的存储目录 --> <property> <name>hadoop.tmp.dir</name> <value>/opt/ha/hadoop-3.1.3/data</value> </property> <!-- 指定zkfc要连接的zkServer地址 --> <property> <name>ha.zookeeper.quorum</name> <value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value> </property> <!-- NN连接JN重试次数 --> <property> <name>ipc.client.connect.max.retries</name> <value>10</value> </property> <!-- 重试时间间隔 --> <property> <name>ipc.client.connect.retry.interval</name> <value>1000</value> </property> </configuration>
|
2.1.2 hdfs-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
| <configuration> <property> <name>dfs.namenode.name.dir</name> <value>file://${hadoop.tmp.dir}/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file://${hadoop.tmp.dir}/data</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>${hadoop.tmp.dir}/jn</value> </property> <property> <name>dfs.nameservices</name> <value>mycluster</value> </property> <property> <name>dfs.ha.namenodes.mycluster</name> <value>nn1,nn2,nn3</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn1</name> <value>hadoop102:8020</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn2</name> <value>hadoop103:8020</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn3</name> <value>hadoop104:8020</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>hadoop102:9870</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn2</name> <value>hadoop103:9870</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn3</name> <value>hadoop104:9870</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://hadoop102:8485;hadoop103:8485;hadoop104:8485/mycluster</value> </property> <property> <name>dfs.client.failover.proxy.provider.mycluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/atguigu/.ssh/id_rsa</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> </configuration>
|
2.1.3 yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
| <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>cluster-yarn1</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2,rm3</value> </property>
<property> <name>yarn.resourcemanager.hostname.rm1</name> <value>hadoop102</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>hadoop102:8088</value> </property> <property> <name>yarn.resourcemanager.address.rm1</name> <value>hadoop102:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm1</name> <value>hadoop102:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm1</name> <value>hadoop102:8031</value> </property>
<property> <name>yarn.resourcemanager.hostname.rm2</name> <value>hadoop103</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>hadoop103:8088</value> </property> <property> <name>yarn.resourcemanager.address.rm2</name> <value>hadoop103:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm2</name> <value>hadoop103:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm2</name> <value>hadoop103:8031</value> </property>
<property> <name>yarn.resourcemanager.hostname.rm3</name> <value>hadoop104</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm3</name> <value>hadoop104:8088</value> </property> <property> <name>yarn.resourcemanager.address.rm3</name> <value>hadoop104:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm3</name> <value>hadoop104:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm3</name> <value>hadoop104:8031</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
|
三、Hadoop操作
启动/停止集群
首次启动
- 格式化NameNode(NameNode节点执行):
hdfs namenode -format
- 同步NameNode元数据信息(其他NameNode节点执行):
hdfs namenode -bootstrapStandby
格式化会产生新的集群id,所以不能重复格式化,重复格式化需要先停止NameNode和DataNode进程,并删除所有节点data和logs目录
启动/停止集群
- 启动/停止HDFS:
- HDFS整体(NameNode节点执行):
start/stop-dfs.sh
- NameNode(NameNode节点执行):
hdfs --daemon start/stop namenode
- DataNode(DataNode节点执行):
hdfs --daemon start/stop datanode
- SecondaryNameNode(SecondaryNameNode节点执行):
hdfs --daemon start/stop secondarynamenode
- journalnode(journalnode节点执行):
hdfs --daemon start/stop journalnode
- 启动/停止YARN:
- YARN整体(ResourceManager节点执行):
start/stop-yarn.sh
- ResourceManager(ResourceManager节点执行):
yarn --daemon start/stop resourcemanager
- NodeManager(NodeManager节点执行):
yarn --daemon start/stop nodemanager
- 启动历史服务器(NameNode节点执行):
mapred --daemon start/stop historyserver
- 启动journalnode服务(journalnode节点执行):
hdfs --daemon start/stop journalnode
查看集群状态:jps
操作集群
4.1 Hadoop命令
- 查看命令帮助:
hadoop
- 查看子命令帮助:
hadoop 子命令
- 查看子选项帮助:
hadoop 子命令 -help 子选项
常用子命令 |
作用 |
举例 |
jar |
运行jar包 |
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -D mapreduce.job.queuename=hive /input /output |
archive |
创建一个hadoop存档 |
hadoop archive -archiveName input.har -p /user/atguigu/input /user/atguigu/output |
fs |
管理HDFS文件系统 |
hadoop fs -ls har:///user/atguigu/output/input.har Hadoop fs -cat 2020-06-24/* | zcat |
4.2 HDFS命令
- 查看命令帮助:
hdfs
- 查看子命令帮助:
hdfs 子命令
- 查看子选项帮助:
hdfs 子命令 -help 子选项
常用选项 |
作用 |
举例 |
–daemon |
管理HDFS线程 |
hdfs --daemon start/stop journalnode |
常用子命令 |
作用 |
举例 |
namenode |
管理NameNode |
hdfs namenode -format |
dfs |
管理HDFS文件系统 |
hdfs dfs -put ./wuguo.txt /sanguo |
oiv |
转换fsimage文件 |
hdfs oiv -p XML -i fsimage_0000000000000000025 -o /opt/module/hadoop-3.1.3/fsimage.xml |
oev |
转换edits文件 |
hdfs oev -p XML -i edits_0000000000000000012-0000000000000000013 -o /opt/module/hadoop-3.1.3/edits.xml |
dfsadmin |
运行DFS管理员客户端 |
hdfs dfsadmin -safemode leave |
diskbalancer |
磁盘负载均衡 |
hdfs diskbalancer -plan hadoop102 |
haadmin |
管理HA |
hdfs haadmin -transitionToActive nn1 |
zkfc |
管理ZK故障转移 |
hdfs zkfc -formatZK |
4.3 YARN命令
- 查看命令帮助:
yarn
- 查看子命令帮助:
yarn 子命令
- 查看子选项帮助:
yarn 子命令 -help 子选项
常用选项 |
作用 |
举例 |
–daemon |
管理YARN线程 |
yarn --daemon start/stop resourcemanager |
常用子命令 |
作用 |
举例 |
rmadmin |
管理RM |
yarn rmadmin -getServiceState rm1 |
4.4 MR命令
- 查看命令帮助:
mapred
- 查看子命令帮助:
mapred 子命令
- 查看子选项帮助:
mapred 子命令 -help 子选项
常用选项 |
作用 |
举例 |
–daemon |
管理mapred线程 |
mapred --daemon start/stop historyserver |
4.5 内置集群脚本
① 启动/停止集群
- HDFS整体(NameNode节点执行):
start/stop-dfs.sh
- YARN整体(ResourceManager节点执行):
start/stop-yarn.sh
② 启动/停止集群数据均衡
在较为空闲的节点开启数据均衡
- 开启集群数据均衡:
start-balancer.sh -threshold 10
(10:代表集群中各节点的磁盘空间利用率相差不超过10%)
- 关闭集群数据均衡:
stop-balancer.sh
4.6 自定义集群脚本
- 脚本存放位置:
~/bin
(需新建bin目录,默认为全局环境变量)
- 给予脚本执行权限:
chmod u+x 脚本路径
- 分发脚本,使每个节点都可执行
① 分发脚本
- 作用:同步集群间的本地文件(只同步有修改的文件),但需注意管理员权限问题
② 群起脚本
③ 格式化清除脚本
④ JPS脚本
四、Hadoop API
API文档:https://hadoop.apache.org/docs/r3.1.3/api/index.html
- 导入依赖坐标和日志添加
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.1.3</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.30</version> </dependency> <build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.6.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </dependencies>
|
- 创建并编辑配置文件log4j.properties
1 2 3 4 5 6 7 8
| log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
|
五、Hadoop客户端运行
本地运行:配置环境后直接运行程序
本地打jar包到集群运行
- 编写程序,打jar包
- 拷贝不带依赖的jar包到集群本地
- 在集群运行:
hadoop jar jar包名 入口类的全类名 输入目录 输出目录
本地提交到集群运行
Driver类添加必要配置信息
Configuration conf = new Configuration();
//设置在集群运行的相关参数-设置HDFS,NAMENODE的地址
conf.set("fs.defaultFS", "hdfs://hadoop102:8020");
//指定MR运行在Yarn上
conf.set("mapreduce.framework.name","yarn");
//指定MR可以在远程集群运行
conf.set("mapreduce.app-submission.cross-platform","true");
//指定yarn resourcemanager的位置
conf.set("yarn.resourcemanager.hostname","hadoop103")
1 2 3 4 5
| 3. 编辑Program arguments参数
4. ```Plain hdfs://hadoop102:9820/输入目录 hdfs://hadoop102:9820/输出目录
|
程序编写完毕后,打jar包
将jar包设置到Driver,参数为jar包绝对路径
job.setJar("C:\\Users\\skiin\\IdeaProjects\\mapreduce1021\\target\\mapreduce1021-1.0-SNAPSHOT.jar");
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| 8. 运行程序
# 附:Hadoop目录结构
bin:Hadoop的相关脚本
sbin:Hadoop的启动停止相关脚本
etc:Hadoop的配置文件
lib:Hadoop的本地库
share:Hadoop的依赖jar包、文档、和官方案例
include
libexec
README.txt
NOTICE.txt
LICENSE.txt一、环境搭建
Apache Hadoop下载地址:https://hadoop.apache.org/releases.ht
Cloudera Hadoop下载地址:
Hortonworks Hadoop下载地址:https://hortonworks.com/downloads
1. 解压安装:`tar -zxf /opt/software/hadoop-3.1.3.tar.gz -C /opt/module/`
2. 配置并生效环境变量
1. `vim /etc/profile.d/my_env.sh`
2. ```Bash #HADOOP_HOME export HADOOP_HOME=/opt/module/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
|
source /etc/profile.d/my_env.sh
测试安装结果:hadoop version
二、环境配置
一、Hadoop本地模式
运行wordcount案例
- 配置服务器本地输入目录及文件,输出目录不能已存在
- 运行wordcount案例:
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount 输入目录 输出目录
hadoop : bin目录下的命令
jar : 参数表示运行一个jar包
share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar :官方案例的jar包
wordcount :官方案例的jar包中的案例名称
二、Hadoop集群模式
此处为完全分布式模式
部署规划和环境配置
|
hadoop102 |
hadoop103 |
hadoop104 |
HDFS |
NameNodeDataNode |
DataNode |
SecondaryNameNodeDataNode |
YARN |
NodeManager |
ResourceManagerNodeManager |
NodeManager |
1.2 配置文件
每台节点都需要配置,可执行分发脚本
1.2.1 workers
- 配置群起文件workers(不能有多余空格和空行):
vim ./etc/hadoop/workers
1 2 3
| hadoop102 hadoop103 hadoop104
|
1.2.2 core-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop102:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop/data</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>wuqiuxu</value> </property> <property> <name>hadoop.proxyuser.wuqiuxu.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.wuqiuxu.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.wuqiuxu.users</name> <value>*</value> </property> <property> <name>fs.trash.interval</name> <value>1</value> </property> </configuration>
|
1.2.3 hdfs-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.namenode.http-address</name> <value>hadoop102:9870</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop104:9868</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.checkpoint.period</name> <value>3600</value> </property> <property> <name>dfs.namenode.checkpoint.txns</name> <value>1000000</value> <description>操作动作次数</description> </property> <property> <name>dfs.namenode.checkpoint.check.period</name> <value>60</value> <description> 1分钟检查一次操作次数</description> </property> <property> <name>dfs.blockreport.intervalMsec</name> <value>21600000</value> <description>Determines block reporting interval in milliseconds.</description> </property> <property> <name>dfs.datanode.directoryscan.interval</name> <value>21600</value> <description>Interval in seconds for Datanode to scan data directories and reconcile the difference between blocks in memory and on the disk. Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval. </description> </property> <property> <name>dfs.namenode.heartbeat.recheck-interval</name> <value>300000</value> </property> <property> <name>dfs.heartbeat.interval</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file://${hadoop.tmp.dir}/dfs/data1,file://${hadoop.tmp.dir}/dfs/data2</value> </property> <property> <name>dfs.hosts</name> <value>/opt/module/hadoop-3.1.3/etc/hadoop/whitelist</value> </property> <property> <name>dfs.hosts.exclude</name> <value>/opt/module/hadoop-3.1.3/etc/hadoop/blacklist</value> </property> </configuration>
|
1.2.4 yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop103</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>true</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> </property>
<property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://hadoop102:19888/jobhistory/logs</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.resourcemanager.scheduler.client.thread-count</name> <value>8</value> </property> <property> <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name> <value>false</value> </property> <property> <name>yarn.nodemanager.resource.detect-hardware-capabilities</name> <value>false</value> </property> <property> <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name> <value>1.0</value> </property> </configuration>
|
1.2.5 mapred-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
| <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop102:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop102:19888</value> </property> <property> <name>mapreduce.job.ubertask.enable</name> <value>true</value> </property> <property> <name>mapreduce.job.ubertask.maxmaps</name> <value>9</value> </property> <property> <name>mapreduce.job.ubertask.maxreduces</name> <value>1</value> </property> <property> <name>mapreduce.job.ubertask.maxbytes</name> <value></value> </property> </configuration>
|
1.2.6 capacity-scheduler.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
| <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,hive</value> <description> The queues at the this level (root is the root queue). </description> </property>
<property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>40</value> </property>
<property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>60</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.capacity</name> <value>60</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.user-limit-factor</name> <value>1</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.maximum-capacity</name> <value>80</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.state</name> <value>RUNNING</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name> <value>*</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name> <value>*</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name> <value>*</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime</name> <value>-1</value> </property>
<property> <name>yarn.scheduler.capacity.root.hive.default-application-lifetime</name> <value>-1</value> </property>
|
1.2.7 黑白名单
- 新增白名单:
vim ./etc/hadoop/whitelist
1 2 3 4
| hadoop102 hadoop103 hadoop104 hadoop105
|
- 新增黑名单:
vim ./etc/hadoop/blacklist
1.3 集群时间同步
内网环境下需要配置(需要root权限),选一台节点作为时间服务器,其他节点进行同步
配置时间服务器
关闭ntpt服务
- 临时关闭服务:systemctl stop ntpd
- 关闭开机自启动:systemctl disable ntpd
修改配置文件ntp.conf:vim /etc/ntp.conf
# 修改:授权可同步的节点网段
restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap
# 修改:注释互联网时间
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
# 添加:丢失网络时间后启用本地时间
server 127.127.1.0
fudge 127.127.1.0 stratum 10
1 2 3 4 5 6
| 3. 修改文件ntpd:vim /etc/sysconfig/ntpd
- ```Bash # 同步硬件时间 SYNC_HWCLOCK=yes
|
- 开启ntpd服务
- 临时开启ntpd服务:systemctl start ntpd
- 开启开机自启动:systemctl enable ntpd
配置其他服务器同步时间
- 关闭ntpt服务
- 临时关闭服务:systemctl stop ntpd
- 关闭开机自启动:systemctl disable ntpd
- 配置定时任务以同步时间:crontab -e
# 例:每分钟同步
*/1 * * * * /usr/sbin/ntpdate hadoop102
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| 1. ## 部署规划和环境配置(HA)
| | hadoop102 | hadoop103 | hadoop104 | | ----------- | -------------------------- | -------------------------- | -------------------------- | | HDFS | **NameNode**DataNode | **NameNode**DataNode | **NameNode**DataNode | | YARN | ResourceManagerNodeManager | ResourceManagerNodeManager | ResourceManagerNodeManager | | Zookeeper | Zookeeper | Zookeeper | Zookeeper | | JournalNode | JournalNode | JournalNode | JournalNode | | ZKFC | ZKFC | ZKFC | ZKFC |
以第二节(Hadoop集群模式)中的环境配置为基础,重写相关配置文件
- HDFS HA官方文档:https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html - YARN HA官方文档:https://hadoop.apache.org/docs/r3.1.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html
### 2.1 HA配置文件
#### 2.1.1 core-site.xml
```XML <configuration> <!-- 把多个NameNode的地址组装成一个集群mycluster --> <property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property> <!-- 指定hadoop运行时产生文件的存储目录 --> <property> <name>hadoop.tmp.dir</name> <value>/opt/ha/hadoop-3.1.3/data</value> </property> <!-- 指定zkfc要连接的zkServer地址 --> <property> <name>ha.zookeeper.quorum</name> <value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value> </property> <!-- NN连接JN重试次数 --> <property> <name>ipc.client.connect.max.retries</name> <value>10</value> </property> <!-- 重试时间间隔 --> <property> <name>ipc.client.connect.retry.interval</name> <value>1000</value> </property> </configuration>
|
2.1.2 hdfs-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
| <configuration> <property> <name>dfs.namenode.name.dir</name> <value>file://${hadoop.tmp.dir}/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file://${hadoop.tmp.dir}/data</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>${hadoop.tmp.dir}/jn</value> </property> <property> <name>dfs.nameservices</name> <value>mycluster</value> </property> <property> <name>dfs.ha.namenodes.mycluster</name> <value>nn1,nn2,nn3</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn1</name> <value>hadoop102:8020</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn2</name> <value>hadoop103:8020</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn3</name> <value>hadoop104:8020</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>hadoop102:9870</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn2</name> <value>hadoop103:9870</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn3</name> <value>hadoop104:9870</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://hadoop102:8485;hadoop103:8485;hadoop104:8485/mycluster</value> </property> <property> <name>dfs.client.failover.proxy.provider.mycluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/atguigu/.ssh/id_rsa</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> </configuration>
|
2.1.3 yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
| <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>cluster-yarn1</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2,rm3</value> </property>
<property> <name>yarn.resourcemanager.hostname.rm1</name> <value>hadoop102</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>hadoop102:8088</value> </property> <property> <name>yarn.resourcemanager.address.rm1</name> <value>hadoop102:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm1</name> <value>hadoop102:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm1</name> <value>hadoop102:8031</value> </property>
<property> <name>yarn.resourcemanager.hostname.rm2</name> <value>hadoop103</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>hadoop103:8088</value> </property> <property> <name>yarn.resourcemanager.address.rm2</name> <value>hadoop103:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm2</name> <value>hadoop103:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm2</name> <value>hadoop103:8031</value> </property>
<property> <name>yarn.resourcemanager.hostname.rm3</name> <value>hadoop104</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm3</name> <value>hadoop104:8088</value> </property> <property> <name>yarn.resourcemanager.address.rm3</name> <value>hadoop104:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address.rm3</name> <value>hadoop104:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address.rm3</name> <value>hadoop104:8031</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>hadoop102:2181,hadoop103:2181,hadoop104:2181</value> </property> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
|
三、Hadoop操作
启动/停止集群
首次启动
- 格式化NameNode(NameNode节点执行):
hdfs namenode -format
- 同步NameNode元数据信息(其他NameNode节点执行):
hdfs namenode -bootstrapStandby
格式化会产生新的集群id,所以不能重复格式化,重复格式化需要先停止NameNode和DataNode进程,并删除所有节点data和logs目录
启动/停止集群
- 启动/停止HDFS:
- HDFS整体(NameNode节点执行):
start/stop-dfs.sh
- NameNode(NameNode节点执行):
hdfs --daemon start/stop namenode
- DataNode(DataNode节点执行):
hdfs --daemon start/stop datanode
- SecondaryNameNode(SecondaryNameNode节点执行):
hdfs --daemon start/stop secondarynamenode
- journalnode(journalnode节点执行):
hdfs --daemon start/stop journalnode
- 启动/停止YARN:
- YARN整体(ResourceManager节点执行):
start/stop-yarn.sh
- ResourceManager(ResourceManager节点执行):
yarn --daemon start/stop resourcemanager
- NodeManager(NodeManager节点执行):
yarn --daemon start/stop nodemanager
- 启动历史服务器(NameNode节点执行):
mapred --daemon start/stop historyserver
- 启动journalnode服务(journalnode节点执行):
hdfs --daemon start/stop journalnode
查看集群状态:jps
操作集群
4.1 Hadoop命令
- 查看命令帮助:
hadoop
- 查看子命令帮助:
hadoop 子命令
- 查看子选项帮助:
hadoop 子命令 -help 子选项
常用子命令 |
作用 |
举例 |
jar |
运行jar包 |
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -D mapreduce.job.queuename=hive /input /output |
archive |
创建一个hadoop存档 |
hadoop archive -archiveName input.har -p /user/atguigu/input /user/atguigu/output |
fs |
管理HDFS文件系统 |
hadoop fs -ls har:///user/atguigu/output/input.har Hadoop fs -cat 2020-06-24/* | zcat |
4.2 HDFS命令
- 查看命令帮助:
hdfs
- 查看子命令帮助:
hdfs 子命令
- 查看子选项帮助:
hdfs 子命令 -help 子选项
常用选项 |
作用 |
举例 |
–daemon |
管理HDFS线程 |
hdfs --daemon start/stop journalnode |
常用子命令 |
作用 |
举例 |
namenode |
管理NameNode |
hdfs namenode -format |
dfs |
管理HDFS文件系统 |
hdfs dfs -put ./wuguo.txt /sanguo |
oiv |
转换fsimage文件 |
hdfs oiv -p XML -i fsimage_0000000000000000025 -o /opt/module/hadoop-3.1.3/fsimage.xml |
oev |
转换edits文件 |
hdfs oev -p XML -i edits_0000000000000000012-0000000000000000013 -o /opt/module/hadoop-3.1.3/edits.xml |
dfsadmin |
运行DFS管理员客户端 |
hdfs dfsadmin -safemode leave |
diskbalancer |
磁盘负载均衡 |
hdfs diskbalancer -plan hadoop102 |
haadmin |
管理HA |
hdfs haadmin -transitionToActive nn1 |
zkfc |
管理ZK故障转移 |
hdfs zkfc -formatZK |
4.3 YARN命令
- 查看命令帮助:
yarn
- 查看子命令帮助:
yarn 子命令
- 查看子选项帮助:
yarn 子命令 -help 子选项
常用选项 |
作用 |
举例 |
–daemon |
管理YARN线程 |
yarn --daemon start/stop resourcemanager |
常用子命令 |
作用 |
举例 |
rmadmin |
管理RM |
yarn rmadmin -getServiceState rm1 |
4.4 MR命令
- 查看命令帮助:
mapred
- 查看子命令帮助:
mapred 子命令
- 查看子选项帮助:
mapred 子命令 -help 子选项
常用选项 |
作用 |
举例 |
–daemon |
管理mapred线程 |
mapred --daemon start/stop historyserver |
4.5 内置集群脚本
① 启动/停止集群
- HDFS整体(NameNode节点执行):
start/stop-dfs.sh
- YARN整体(ResourceManager节点执行):
start/stop-yarn.sh
② 启动/停止集群数据均衡
在较为空闲的节点开启数据均衡
- 开启集群数据均衡:
start-balancer.sh -threshold 10
(10:代表集群中各节点的磁盘空间利用率相差不超过10%)
- 关闭集群数据均衡:
stop-balancer.sh
4.6 自定义集群脚本
- 脚本存放位置:
~/bin
(需新建bin目录,默认为全局环境变量)
- 给予脚本执行权限:
chmod u+x 脚本路径
- 分发脚本,使每个节点都可执行
① 分发脚本
- 作用:同步集群间的本地文件(只同步有修改的文件),但需注意管理员权限问题
② 群起脚本
③ 格式化清除脚本
④ JPS脚本
四、Hadoop API
API文档:https://hadoop.apache.org/docs/r3.1.3/api/index.html
- 导入依赖坐标和日志添加
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.1.3</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.30</version> </dependency> <build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.6.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </dependencies>
|
- 创建并编辑配置文件log4j.properties
1 2 3 4 5 6 7 8
| log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
|
五、Hadoop客户端运行
本地运行:配置环境后直接运行程序
本地打jar包到集群运行
- 编写程序,打jar包
- 拷贝不带依赖的jar包到集群本地
- 在集群运行:
hadoop jar jar包名 入口类的全类名 输入目录 输出目录
本地提交到集群运行
Driver类添加必要配置信息
Configuration conf = new Configuration();
//设置在集群运行的相关参数-设置HDFS,NAMENODE的地址
conf.set("fs.defaultFS", "hdfs://hadoop102:8020");
//指定MR运行在Yarn上
conf.set("mapreduce.framework.name","yarn");
//指定MR可以在远程集群运行
conf.set("mapreduce.app-submission.cross-platform","true");
//指定yarn resourcemanager的位置
conf.set("yarn.resourcemanager.hostname","hadoop103")
1 2 3 4 5
| 3. 编辑Program arguments参数
4. ```Plain hdfs://hadoop102:9820/输入目录 hdfs://hadoop102:9820/输出目录
|
程序编写完毕后,打jar包
将jar包设置到Driver,参数为jar包绝对路径
job.setJar("C:\\Users\\skiin\\IdeaProjects\\mapreduce1021\\target\\mapreduce1021-1.0-SNAPSHOT.jar");
运行程序
附:Hadoop目录结构
bin:Hadoop的相关脚本
sbin:Hadoop的启动停止相关脚本
etc:Hadoop的配置文件
lib:Hadoop的本地库
share:Hadoop的依赖jar包、文档、和官方案例
include
libexec
README.txt
NOTICE.txt
LICENSE.txt