Install Apache Hadoop on Linux CentOS (Single Instance)
In this tutorial I will go over the required steps for setting up a single-node Hadoop Cluster backed by the Hadoop Distributed File System, running on CentOS Linux.
Our Linux Box will reside on our local system on top of VirtualBox running a CentOS 64 Bits.
The goal is to get a simple Hadoop installation up and running so we can play around with it and start learning Hadoop stuff.
First we need to create and configure this box.
if you haven't done this yet, follow this tutorial and see how you can install and configure VirtualBox and CentOS on your local machine.
After you have your Linux up and running you need to install all prerequisites such as Java,SSH config,Hadoop user.
Install Java in CentOS
Or watch it here:
Youtube link
Check that Java was well installed:
Create the Hadoop user and group.
We will use a dedicated Hadoop user account for running Hadoop. This is a good practice and is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.
Will create a group and the user.
Install and Configure SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine.
For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hdpusr user we created in the previous section.
Install(if you have it skip this step)
Setup ssh for hdpusr user.
run ssh-keygen command.
create auth file as a copy of oyur id_rsa.pub key
The final step is to test the SSH setup by connecting to your local machine with the hdusr user.
ssh for our hdpusr to our localhost is done.
Disable IPv6
you need to edit the /etc/sysctl.conf file.
Make them permanent:
Ok not that we are done with the prerequisite lets go ahead and install Hadoop
You need to download the .tar file from Apache.Org website(i got the latest).
Once is on your system unzip the content of it.
Create a folder in you /usr/local location called hadoop and place the unzipped hadoop files into it:
Change group and user ownership over the hadoop folder:
Edit the .bashrc file in hdpusr home
To get the JAVA_HOME in CentOS
you need to add some hadoop variables
Content of your .bashrc file
After you edit the .bashrc file make sure that the JAVA_HOME is set:
- all good.
Next we need to add our JAVA PATH to hadoop-env.sh file.
If you have used the same folder names as i did the file should be located here :
If is not there try looking for it using the command bellow:
So if you have the JAVA_HOME set as we did in the last step you wont have to edit the file.
If the JAVA_HOME is not set you need to edit the file:
Create a temporary area for Hadoop file system
since is a small single Hadoop install will do it inside hdpusr home directory.
Configure and edit the core-site.xml file
add the temporary location to your core-site.xml file.
Edit mapred-site.xml file
add the
Edit hdfs-site.xml file
If you need more info on the configuration file content see the Hadoop Guide
Next we need to format the HDFS filesystem using the NameNode
at this point we are ready to add the local fs to hdfs, for this we need to format it.
script will use the variable dfs.name.dir to point the the new formated hdfs.
The output will be something like:
Start the HDSF on a single Hadoop Cluster:
Check if services are up and running
or -l option
Nice is up and running.
Also if you wanna debug the logs , you can find them at
Stop all Hadoop components
Check is Hadoop services are stopped
That is boys, we have HADOOP installed and ready to be used as our learning Lab.
In the next tutorial we will see how can we run a simple MapReduce job.
hope this was useful...