The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. – From Apache Hadoop

基础环境: Vagrant + CentOS7

Vagrant 安装

Q: 为什么使用 Vagrant 呢?

A: Vagrant is the command line utility for managing the lifecycle of virtual machines.

# Vagrant下载页: https://www.vagrantup.com/downloads.html
# VirtualBox 下载页: https://www.virtualbox.org/wiki/Downloads

Vagrant 安装 CentOS7

# 下载box: 链接:https://pan.baidu.com/s/1wTgDeyf8y65mxV6z1TQDmw  密码:z33b
vagrant box add centos/7 <box本地路径>
# 初始化
vagrant init centos/7
# 启动
vagrant up
# 进入
vagrant ssh
# 配置私有网络, 指定IP地址
vi Vagrantfile
# 修改以下内容
# Create a private network, which allows host-only access to the machine
# using a specific IP.
config.vm.network "private_network", ip: "192.168.33.11"

# 设置密码, vagrant是默认用户, 密码: 123456
sudo passwd vagrant

yum 换源

yum install -y wget
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum update -y
# yum makecache

Prerequisites 先决条件

# 安装 Java 环境
sudo yum install -y java-1.8.0-openjdk-devel.x86_64

# pdsh 作为 ssh 资源管理(有ssh的话, 可以忽略)
# sudo yum install -y ssh pdsh

Pseudo-Distributed Operation 伪分布式安装

下载hadoop

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar -xvf hadoop-3.2.1.tar.gz
rm -f hadoop-3.2.1.tar.gz
mv hadoop-3.2.1 hadoop

# 进入到 hadoop 目录下
cd hadoop

添加环境变量

# 编辑 /etc/profile
sudo vi /etc/profile
# 添加环境变量
export HADOOP_HOME=/home/vagrant/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

# 生效
source /etc/profile

修改 etc/hadoop/hadoop-env.sh

# set to the root of your Java installation
export JAVA_HOME=/usr/lib/jvm/java

# 测试是否配置成功: This will display the usage documentation for the hadoop script.
hadoop

Hadoop 配置

# 编辑 etc/hadoop/core-site.xml
vi etc/hadoop/core-site.xml
# 添加以下内容
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

# 编辑 etc/hadoop/hdfs-site.xml
# 添加以下内容
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

配置无密码的ssh

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

启动 Hadoop

# 格式化文件系统
hdfs namenode -format
# 启动 NameNode 和 DataNode 进程
start-dfs.sh

# 访问 NameNode
http://localhost:9870/

# 停止进程
stop-dfs.sh

YARN

# 编辑 etc/hadoop/mapred-site.xml
vi etc/hadoop/mapred-site.xml
# 添加以下内容
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

# 编辑 etc/hadoop/yarn-site.xml
vi etc/hadoop/yarn-site.xml
# 添加以下内容
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

# 启动 ResourceManager 和 NodeManager 进程
start-yarn.sh

# 访问 ResourceManager
http://localhost:8088/

# 停止进程
stop-yarn.sh

Hadoop Cluster Setup 分布式部署

另启两台机器(同Vagrant 安装 CentOS7), 每台机器上都需要安装 Java 环境, 并配置环境变量

设置主机名 hostname

sudo vi /etc/hostname
# 三台机器分别修改成 master worker1 worker2

# 重启永久生效
sudo reboot

# 建立主机名与IP的映射
sudo vi /etc/hosts
# 添加以下内容
192.168.33.11 master
192.168.33.12 worker1
192.168.33.13 worker2

配置无密码的ssh(同伪分布式)

配置ssh免密码登录

# 三台机器上都生成无密码的 id_rsa.pub (公钥)
# 将三台机器的公钥都加入到 master 的 ~/.ssh/authorized_keys 中
# 最后分发到另两台机器上
rsync -av --progress ~/.ssh/authorized_keys vagrant@worker1:/home/vagrant/.ssh/
rsync -av --progress ~/.ssh/authorized_keys vagrant@worker2:/home/vagrant/.ssh/

# master 上测试无密码连接 worker1
[vagrant@master ~]$ ssh worker1
Last login: Thu Oct 31 12:58:35 2019 from 10.0.2.2
[vagrant@worker1 ~]$

仅对 master 上的 Hadoop 进行配置

# 编辑 etc/hadoop/hadoop-env.sh
vi etc/hadoop/hadoop-env.sh
# set to the root of your Java installation
export JAVA_HOME=/usr/lib/jvm/java 

# 编辑 etc/hadoop/core-site.xml
vi etc/hadoop/core-site.xml
# 添加以下内容
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
</property>

# 编辑 etc/hadoop/hdfs-site.xml
vi etc/hadoop/hdfs-site.xml
# 添加以下内容
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

# 编辑 etc/hadoop/yarn-site.xml
vi etc/hadoop/yarn-site.xml
# 添加以下内容
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.env-whitelist</name>
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
</property>
<property>
    <name>yarn.application.classpath</name>
    <value>/home/vagrant/hadoop/etc/hadoop:/home/vagrant/hadoop/share/hadoop/common/lib/*:/home/vagrant/hadoop/share/hadoop/common/*:/home/vagrant/hadoop/share/hadoop/hdfs:/home/vagrant/hadoop/share/hadoop/hdfs/lib/*:/home/vagrant/hadoop/share/hadoop/hdfs/*:/home/vagrant/hadoop/share/hadoop/mapreduce/lib/*:/home/vagrant/hadoop/share/hadoop/mapreduce/*:/home/vagrant/hadoop/share/hadoop/yarn:/home/vagrant/hadoop/share/hadoop/yarn/lib/*:/home/vagrant/hadoop/share/hadoop/yarn/*</value>
</property>

# 编辑 etc/hadoop/mapred-site.xml
vi etc/hadoop/mapred-site.xml
# 添加以下内容
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
<property>
    <name>mapreduce.application.classpath</name>
    <value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value>
</property>

# 编辑 etc/hadoop/yarn-site.xml


# 添加 etc/hadoop/workers 文件
vi etc/hadoop/workers
# 添加以下内容
worker1
worker2

将 hadoop 的配置分发到另两台机器上

# worker1
rsync -av --progress etc/* vagrant@192.168.33.12:/home/vagrant/hadoop/etc/

# worker2
rsync -av --progress etc/* vagrant@192.168.33.13:/home/vagrant/hadoop/etc/

启动 Hadoop(仅在 master 上操作)

# 格式化 DFS
hdfs namenode -format master
# 启动 DFS
start-dfs.sh

# 启动 YARN
start-yarn.sh

# 下面的命令可以忽略
# Start the MapReduce JobHistory Server with the following command, run on the designated server as mapred:
# mapred --daemon start historyserver

停止 Hadoop(仅在 master 上操作)

stop-dfs.sh
stop-yarn.sh
mapred --daemon stop historyserver

参考