How to Install Apache Hadoop on Linux Centos 6 Single Node
HDFS or "Hadoop Distributed File System" is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
HDFS was inspired in the Google File System,
HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to re-balance data, to move copies around, and to keep the replication of data high.
We going to see in this tutorial how we can install HDFS in a single node cluster and we will see the overall configuration of HDFS only required to start working with HDFS.
These are the steps you need to take to install and configure HDFS on a Linux Centos Box.
1 - Install Java
Java is a requirement for running Hadoop on any system, So make sure you have Java installed on your system using following command.
2- Create a user called hadoop
We will create the user hadoop and a group hadoopgrp, next will add the hadoop user to the hadoopgrp group.
3 - Enable ssh-password-less to the host to user hadoop
You will require this access for the hadoop instalation, simply follow the step bellow
Validate ssh access;
Accept and continue.
4 - Download the lasted stable version of Hadoop
Download , untar and move the hadoop distribution.
5 - Setup the Hadoop environment variables
Append the following to the /home/hadoop/.bashrc file. Make sure you have your path to the hadoop home as bellow otherwise you need to change it to apply to your hadoop home path.
Now apply the changes in current running environment
6 - Set the Java home to the Hadoop eenvironment
You need to look for the hadoop-env.sh file and edit the JAVA_HOME parameter value.
Also you need to find the JAVA_HOME you have installed previously
Edit the hadoop-env.sh save and close
7 - Edit the core-site.xml file located in the same path as the hadoop-env.sh
Core-site.xml is the configuration file where you keep all your HDFS related configurations.
Example: Namenode host and port, the local directory where NameNode related stuff can be saved etc.
8 - Edit the hdfs-site.xml file.
Before we go ahead and edit this file we need to create the location that will contain the namenode and the datanode for this Hadoop installation.
You need to point to the location of the nodename and datanode as well the replication value.
9 - Edit the mapred-site.xml file.
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
First find the mapred-site.xml.template and create the file mapred-site.xml out of it.
Edit the mapred-site.xml
10 - Format the file system that we allocated to hdfs
This set of commands should be executed once before we start using Hadoop. If this command is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop file system.
11- Start Hadoop
To start Hadoop you need to use the start-all.sh script, this script is located in the sbin directory of your hadoop installation
To check on the services run the following command:
12 - Stop the Hadoop services
13 - Using the Hadoop Web interface.
For this we need to start the hadoop services and access the 50070 port.
Open a browser and paste the ipaddress of the hadoop host and the port 50070
So we have installed our Hadoop instance and we have it up and running.
Now days we have Hortonworks, CLaudera,MapR that have nice GUI tools to help us with the Hadop eco-system installations but i belive that every Hadoop Admin or wnabe Hadoop admin should know how do all this tasks and understand each component involved in this task.
In next tutorials we will see how to run a MapReduce job in Hadoop.