Hadoop Setup

Hadoop Cluster Setup
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware.

This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes.

Required Software
Required software for Linux and Windows include: 1. Java 1.6.x, preferably from Sun, must be installed. 2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Installation
Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster. Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves.
The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path.

Steps for Installation
1. Install java 1.6
Check java version: $ java –version

2. Adding dedicated user group
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser

3. Install ssh
$ su - hduser Generate keypairs: $ ssh-keygen -t rsa -P "" Enable access to the local machine: $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost

4. Disable IPV6
$ sudo gedit /etc/sysctl.conf Edit the following in the file: # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

5. Download Hadoop and place it in /usr/local/Hadoop
$ $ $ $ cd /usr/local sudo tar xzf hadoop-1.0.3.tar.gz sudo mv hadoop-1.0.3 hadoop sudo chown -R hduser:hadoop Hadoop

6. Update .bashrc
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop) export JAVA_HOME=/usr/lib/jvm/java-6-sun # Some convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls" # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin

For Server Machine only: Copy ssh keypairs to the slave systems
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@ $ ssh $ ssh # repeat the same for all the slaves ips.

Configuration:
1.Setting JAVA_HOME:
Open /usr/local/hadoop/conf/hadoop-env.sh and modify JAVA_HOME to “export JAVA_HOME=/usr/lib/jvm/java-6-sun”

2.Create tmp directory:
$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 755 /app/hadoop/tmp 3. Modify conf/masters # only on master node

#use full ip, don’t use localhost or 127.0.0.1 # only on slave node

4. Modify conf/slaves

#use full ip, don’t use localhost or 127.0.0.1 #use full ip, don’t use localhost or 127.0.0.1 # repeat the same for all slaves

5.Modify conf/core-site.xml: hadoop.tmp.dir /app/hadoop/tmp fs.default.name hdfs://:54310

6.Modify conf/mapred-site.xml: mapred.job.tracker :54311

7.Modify conf/hdfs-site.xml: dfs.replication 2

Formatting the HDFS filesystem
HDFS should be initialized for starting the cluster. File system Directory is specified by dfs.name.dir $ /usr/local/hadoop/bin/hadoop namenode –format #only on master system

Starting the multi-node cluster
Starting the cluster is performed in two steps.
1. We begin with starting the HDFS daemons: the NameNode daemon is started on master,

and DataNode daemons are started on all slaves (here: master and slaves). 2. Then we start the MapReduce daemons: the JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and slaves). Start the HDFS layer: hduser@master:/usr/local/hadoop$ bin/start-dfs.sh On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-hduser-datanode-slave.log.
Java processes on master after starting HDFS daemons hduser@master:/usr/local/hadoop$ jps Output: 14799 NameNode 15314 Jps 14880 DataNode 14977 SecondaryNameNode

Java processes on slave after starting HDFS daemons hduser@slave:/usr/local/hadoop$ jps Output: 15183 DataNode 15616 Jps

Start the Map Reduce Daemons ## Perform the following only on the master node hduser@master:/usr/local/hadoop$ bin/start-mapred.sh

Java processes on master after starting MapReduce daemons hduser@master:/usr/local/hadoop$ jps Output: 16017 Jps 14799 NameNode 15686 TaskTracker 14880 DataNode 15596 JobTracker 14977 SecondaryNameNode hduser@slave:/usr/local/hadoop$ jps Output: 15183 DataNode 15897 TaskTracker 16284 Jps

Java processes on slave after starting MapReduce daemons

Stopping the Cluster:
Like starting the cluster, stopping it is done in two steps. The workflow however is the opposite of starting.
1. We begin with stopping the MapReduce daemons: the JobTracker is stopped on master,

and TaskTracker daemons are stopped on all slaves (here: master and slave). 2. Then we stop the HDFS daemons: the NameNode daemon is stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave). Stopping MapReduce daemons Run the command bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file. Stopping the HDFS layer hduser@master:/usr/local/hadoop$ bin/stop-dfs.sh

Java processes on master after stopping HDFS daemons hduser@master:/usr/local/hadoop$ jps output: 18670 Jps hduser@slave:/usr/local/hadoop$ jps output:

18894 Jps

Running a Map reduce Job (Word Count) #Perform all operation only on master node
1.Start the Hadoop cluster 2.Copy data to HDFS
Download eBooks as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg
Links: http://www.gutenberg.org/ebooks/19699 http://www.gutenberg.org/ebooks/132 http://www.gutenberg.org/ebooks/1661 hduser@master:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg

hduser@master:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

3.Run the Map Reduce Job
(or)

hduser@master:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

4. Check if the result is successfully stored in HDFS directory “/user/hduser/gutenberg-output” hduser@master:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser

5.Copy result to Local Disk
To inspect the file, you can copy it from HDFS to the local file system hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output

Hadoop Web Interfaces

• • •

http://localhost:50070/ – web UI of the NameNode daemon http://localhost:50030/ – web UI of the JobTracker daemon http://localhost:50060/ – web UI of the TaskTracker daemon

Similar Documents

Donald Miner Serves as a Solutions Architect at Emc Greenplum,

Haddoop Installation

Hipi

Hadoop

Bigdata

Hadoop

Abc Ia S Aresume

Cisco Case Study

Big Analytics

Hadopp Yarn

Big Data

Big Data

Integration of Technology

Literature Review

Big Data

Popular Essays