How to install Apache Spark in CentOS Standalone

What is Apache Spark ? Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. In this short tutorial we will see what are the step to install Apache Spark on Linux CentOS Box as a standalone Spark Installation.

First we need to make sure we have Java installed:

Install Java 
[root@aodba ~]# yum install java-1.8.0-openjdk* -y

..
[root@aodba ~]# java -version
openjdk version "1.8.0_101"
OpenJDK Runtime Environment (build 1.8.0_101-b13)
OpenJDK 64-Bit Server VM (build 25.101-b13, mixed mode)

We need to install Scala

Install Scala
wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
tar xvf scala-2.10.1.tgz
sudo mv scala-2.10.1 /usr/lib
sudo ln -s /usr/lib/scala-2.10.1 /usr/lib/scala
export PATH=$PATH:/usr/lib/scala/bin
...
...
[root@aodba~]# scala -version
Scala code runner version 2.10.1 -- Copyright 2002-2013, LAMP/EPFL

Install Apache Spark

Download Spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz
Extract, create a new directory under the /usr/local called spark and copy the extracted connect into it
tar xf spark-2.0.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.0.0-bin-hadoop2.7/* /usr/local/spark
Setup some Environment variables before you start spark-shell
export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar
PATH=$PATH:$HOME/bin:/usr/local/spark/bin
source ~/.bash_profile
 Start you Scala Shell and run your first Spark RDD
  • we will read on file form the root path called anaconda-kr.cfg and then apply a line count on it.
[root@vnode ~]# spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/07/25 17:58:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/25 17:58:09 WARN Utils: Your hostname, vnode resolves to a loopback address: 127.0.0.1; using 192.168.15.205 instead (on interface eth1)
16/07/25 17:58:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/07/25 17:58:11 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.15.205:4040
Spark context available as 'sc' (master = local[*], app id = local-1469433490620).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ / __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala val file = sc.textFile("/root/anaconda-ks.cfg");
file: org.apache.spark.rdd.RDD[String] = /root/anaconda-ks.cfg MapPartitionsRDD[1] at textFile at <console:24

scala file.count();
res0: Long = 25

scala
Or lets get the First line/item in the RDD
scala file.first();
res1: String = # Kickstart file automatically generated by anaconda.
Or find the longest item/line in the RDD
scala file.map(line = line.split(" ").size).reduce((a, b) = Math.max(a, b));
res2: Int = 11
This operation might seem simple but we are not going to use Spark to count words from a small file, Spark is an amazing piece of tech and kicked MapReduce ass :). I will start with series on Apache Spark for A to Z soon. Hope this was helpful