What is Apache Spark ?
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
In this short tutorial we will see what are the step to install Apache Spark on Linux CentOS Box as a standalone Spark Installation.
First we need to make sure we have Java installed:
Install Java
We need to install Scala
Install Scala
Install Apache Spark
Download Spark
Extract, create a new directory under the /usr/local called spark and copy the extracted connect into it
Setup some Environment variables before you start spark-shell
Start you Scala Shell and run your first Spark RDD
we will read on file form the root path called anaconda-kr.cfg and then apply a line count on it.
Or lets get the First line/item in the RDD
Or find the longest item/line in the RDD
This operation might seem simple but we are not going to use Spark to count words from a small file, Spark is an amazing piece of tech and kicked MapReduce ass :).
I will start with series on Apache Spark for A to Z soon.
Hope this was helpful