How to setup a basic Hadoop cluster

This article is a guide to setup a Hadoop cluster. The cluster runs on local CentOS virtual machines using Virtualbox. I use this to have a local environment for development and testing.

I followed many of the steps Austin Ouyang laid out in the blog post here. Hopefully, next I can document using moving these virtual machines to another cloud provider.

Prerequisites

It assumes you are using the following software versions.

MacOS 10.11.3
Vagrant 1.8.5
Java 1.8.0 (Using JRE is fine, use the JDK to run MapReduce examples later)
Hadoop 2.7.3

Here are the steps I used:

First, create a workspace.

mkdir -p ~/vagrant_boxes/hadoop

cd ~/vagrant_boxes/hadoop
Next, create a new vagrant box. I’m using a minimal CentOS vagrant box.

vagrant box add “CentOS 6.5 x86_64” https://github.com/2creatives/vagrant-centos/releases/download/v6.5.3/centos65-x86_64-20140116.box
We are going to create a vagrant box with the packages we need. So, first we initialize the vagrant box.

vagrant init -m “CentOS 6.5 x86_64” hadoop_base

Next, change the Vagrantfile to the following:

Vagrant.configure(2) do |config|
	config.vm.box = "CentOS 6.5 x86_64"
	config.vm.box_url = "hadoop_base"
	config.ssh.insert_key = false
end

Now, install Hadoop and it’s dependencies.

vagrant up

vagrant ssh

sudo yum install java-1.8.0-openjdk-devel

sudo yum install wget

wget http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz ~

gunzip -c *gz | tar xvf –

Open up your ~/.bash_profile and append the following lines.

  export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk.x86_64
  export PATH=$PATH:$JAVA_HOME/bin
  export HADOOP_HOME=~/hadoop-2.7.3
  export PATH=$PATH:$HADOOP_HOME/bin
  export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Source the profile.

source ~/.bash_profile

In /etc/hosts, add the following lines:

  192.168.50.21 namenode.example.com
  192.168.50.22 datanode1.example.com
  192.168.50.23 datanode2.example.com
  192.168.50.24 datanode3.example.com
  192.168.50.25 datanode4.example.com

In $HADOOP_CONF_DIR/hadoop-env.sh, replace the ${JAVA_HOME} variable.

  # The java implementation to use.
  export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk.x86_64

Edit the $HADOOP_CONF_DIR/core-site.xml file to have the following XML:

  <configuration>
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://namenode.example.com:9000</value>
    </property>
  </configuration>

Edit the $HADOOP_CONF_DIR/yarn-site.xml file to have the following XML:

  <configuration>
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
    <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>namenode.example.com</value>
    </property>
  </configuration>

Now, copy the mapred-site.xml file from a template.

cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml

Edit the $HADOOP_CONF_DIR/mapred-site.xml to have the following XML:

  <configuration>
    <property>
      <name>mapreduce.jobtracker.address</name>
      <value>namenode.example.com:54311</value>
    </property>
    <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
    </property>
  </configuration>

Edit the $HADOOP_CONF_DIR/hdfs-site.xml file to have the following XML:

  <configuration>
    <property>
      <name>dfs.replication</name>
      <value>3</value>
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>/data/hadoop/hdfs/namenode</value>
    </property>
    <property>
      <name>dfs.datanode.data.dir</name>
      <value>/data/hadoop/hdfs/datanode</value>
    </property>
  </configuration>

Make the data directories.

sudo mkdir -p /data/hadoop/hdfs/namenode

sudo mkdir -p /data/hadoop/hdfs/datanode

sudo chown -R vagrant:vagrant /data/hadoop
In $HADOOP_CONF_DIR/masters, add the following line:
```
  namenode.example.com
```

In $HADOOP_CONF_DIR/slaves, add the following lines:

  datanode1.example.com
  datanode2.example.com
  datanode3.example.com
  datanode4.example.com

Create a ~/.ssh/config file to avoid host key checking for SSH. Since these are DEV servers, this is ok. Note that the indentation here before StrictHostKeyChecking must be a tab.
```
  Host *
        StrictHostKeyChecking no
```
Now run these commands to finish the password-less authentication.

chmod 600 ~/.ssh/config

sudo hostname namenode.example.com

ssh-keygen -f ~/.ssh/id_rsa -t rsa -P “”

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Exit the SSH session and copy the VM for the other hadoop nodes.

exit

vagrant halt

vagrant package

vagrant box add hadoop ~/vagrant_boxes/hadoop/package.box

Edit the Vagrantfile to look like the following below. This will create 5 Hadoop nodes for us using the new Hadoop VM.

  Vagrant.configure("2") do |config|
    config.vm.define "hadoop-namenode" do |node|
      node.vm.box = "hadoop"
      node.vm.box_url = "namenode.example.com"
      node.vm.hostname = "namenode.example.com"
      node.vm.network :private_network, ip: "192.168.50.21"
      node.ssh.insert_key = false

      # Start Hadoop
      node.vm.provision "shell", inline: "hdfs namenode -format -force", privileged: false
      node.vm.provision "shell", inline: "~/hadoop-2.7.3/sbin/start-dfs.sh", privileged: false
      node.vm.provision "shell", inline: "~/hadoop-2.7.3/sbin/start-yarn.sh", privileged: false
      node.vm.provision "shell", inline: "~/hadoop-2.7.3/sbin/mr-jobhistory-daemon.sh start historyserver", privileged: false
    end

    (1..4).each do |i|
      config.vm.define "hadoop-datanode#{i}" do |node|
        node.vm.box = "hadoop"
        node.vm.box_url = "datanode#{i}.example.com"
        node.vm.hostname = "datanode#{i}.example.com"
        node.vm.network :private_network, ip: "192.168.50.2#{i+1}"
        node.ssh.insert_key = false
      end
    end
  end

Bring the new Vagrant VMs up.

vagrant up –no-provision
Start Hadoop up on the namenode.

vagrant provision

To test to see if the Hadoop is working, you can do the following.

First, from you local machine, you should be able to access the Web UI (http://192.168.50.21:50070/). You should see 3 live nodes running. Follow the MapReduce Tutorial to test your cluster further.

The main overview screen of the hadoop admin console.

Tony Zampogna

How to setup a basic Hadoop cluster

Prerequisites

Here are the steps I used:

To test to see if the Hadoop is working, you can do the following.

Leave a ReplyCancel reply

Prerequisites

Here are the steps I used:

To test to see if the Hadoop is working, you can do the following.

Share this:

Leave a ReplyCancel reply