A layman’s guide to install and setup Apache Spark

https://databricks.com/spark/about

Apache Spark™ is a unified analytics engine for large-scale data processing and one of the first steps towards Big Data Analytics. So let’s get started.

Most of the current cluster programming models are based on acyclic data flow, that is, from stable storage to stable storage. For instance, Hadoop reads data from persistent storage in Map step and then it writes data back to persistent store (HDFS) in reduce. This proves to be convenient to dynamically decide machines or handle failures.

The architecture is decent until we have to iterate the data. Iterating over data is very expensive in most clustering models. Here’s where Apache Spark comes to rescue.

Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting!

Now let’s get our hands dirty, shall we? So the prerequisites for this installation guide are:

  • An Ubuntu system. (We developers love open source more than anything)

Apache Spark has a few dependencies, and we need to get them before we get started.

  1. JDK

Perks of using Linux? we can easily install all these dependencies at once.

sudo apt install default-jdk scala git -y

The installation can be verified by the following command:

java -version; javac -version; scala -version; git --version

Now it’s time to download the MVP, Apache Spark. To do this, we will use wget command and give the direct link to the package to download. (Please look for updates of the version on https://downloads.apache.org/spark/

wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

Let’s unzip the package and move the folder to opt/spark

tar xvf spark-*
sudo mv spark-3.0.1-bin-hadoop2.7 /opt/spark

Configuration of the environment is one of the most tricky parts, so be careful about your typos and directories. We need to add a few lines to user profile which is generally found at .profile. One of the ways to add the lines are by running echo as shown:

echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

We load the user profile again by running source ~/.profile

Let’s test Spark shell, shall we?

Run spark-shell on your terminal to see the following output.

original image

The terminal shown is specific to scala, if you want, you can always run spark shell in python as well. To do so, quit scala by typing :q and then entering the python-spark-shell using the command pyspark

Congratulations! Your Spark is up and running!! Now before we leave, let’s part with a few basic commands.

To start a master server instance on the current machine

start-master.sh

In this single-server, standalone setup, to start a one slave server along with the master server, run the following command in this format: (remember to type in your hostname:port-number in place of master:post

start-slave.sh spark://master:port

To stop the master instance we can run:

stop-master.sh

To stop a running worker process:

stop-slave.sh

There’s a lot to learn and a lot to experiment, but until next time!

Citations:

https://pesitsouth.pes.edu/ — Study material

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store