What is Apache Spark ? Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. In this short tutorial we will see what are the step to install Apache Spark on Linux CentOS Box as a standalone Spark Installation.
[root@aodba ~]# yum install java-1.8.0-openjdk* -y
..
[root@aodba ~]# java -version
openjdk version "1.8.0_101"
OpenJDK Runtime Environment (build 1.8.0_101-b13)
OpenJDK 64-Bit Server VM (build 25.101-b13, mixed mode)
wget http://www.scala-lang.org/files/archive/scala-2.10.1.tgz
tar xvf scala-2.10.1.tgz
sudo mv scala-2.10.1 /usr/lib
sudo ln -s /usr/lib/scala-2.10.1 /usr/lib/scala
export PATH=$PATH:/usr/lib/scala/bin
...
...
[root@aodba~]# scala -version
Scala code runner version 2.10.1 -- Copyright 2002-2013, LAMP/EPFL
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz
tar xf spark-2.0.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.0.0-bin-hadoop2.7/* /usr/local/spark
export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar
PATH=$PATH:$HOME/bin:/usr/local/spark/bin
source ~/.bash_profile
[root@vnode ~]# spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/07/25 17:58:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/25 17:58:09 WARN Utils: Your hostname, vnode resolves to a loopback address: 127.0.0.1; using 192.168.15.205 instead (on interface eth1)
16/07/25 17:58:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/07/25 17:58:11 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.15.205:4040
Spark context available as 'sc' (master = local[*], app id = local-1469433490620).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ / __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.
scala val file = sc.textFile("/root/anaconda-ks.cfg");
file: org.apache.spark.rdd.RDD[String] = /root/anaconda-ks.cfg MapPartitionsRDD[1] at textFile at <console:24
scala file.count();
res0: Long = 25
scala
scala file.first();
res1: String = # Kickstart file automatically generated by anaconda.
scala file.map(line = line.split(" ").size).reduce((a, b) = Math.max(a, b));
res2: Int = 11