How to set up PySpark for your Jupyter notebook

PySpark allows Python programmers to interface with the Spark framework to manipulate data at scale and work with objects over a distributed filesystem.

Image by:

Opensource.com

Apache Spark is one of the hottest frameworks in data science. It realizes the potential of bringing together big data and machine learning. This is because:

It offers robust, distributed, fault-tolerant data objects (called RDDs).
It is fast (up to 100x faster than traditional Hadoop MapReduce) due to in-memory operation.
It integrates beautifully with the world of machine learning and graph analytics through supplementary packages like MLlib and GraphX.

Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language that runs on a Java virtual machine (JVM).

However, Scala is not a great first language to learn when venturing into the world of data science. Fortunately, Spark provides a wonderful Python API called PySpark. PySpark allows Python programmers to interface with the Spark framework—letting them manipulate data at scale and work with objects over a distributed filesystem.

Why use Jupyter Notebook?

The promise of a big data framework like Spark is realized only when it runs on a cluster with a large number of nodes. Unfortunately, to learn and practice that, you have to spend money. Some options are:

Amazon Elastic MapReduce (EMR) cluster with S3 storage
Databricks cluster (paid version; the free community version is rather limited in storage and clustering options)

These options cost money—even to start learning (for example, Amazon EMR is not included in the one-year Free Tier program, unlike EC2 or S3 instances).

However, if you are proficient in Python/Jupyter and machine learning tasks, it makes perfect sense to start by spinning up a single cluster on your local machine. You could also run one on Amazon EC2 if you want more storage and memory.

Remember, Spark is not a new programming language you have to learn; it is a framework working on top of HDFS. This presents new concepts like nodes, lazy evaluation, and the transformation-action (or "map and reduce") paradigm of programming.

Spark is also versatile enough to work with filesystems other than Hadoop, such as Amazon S3 or Databricks (DBFS).

But the idea is always the same. You distribute (and replicate) your large dataset in small, fixed chunks over many nodes, then bring the compute engine close to them to make the whole operation parallelized, fault-tolerant, and scalable.

By working with PySpark and Jupyter Notebook, you can learn all these concepts without spending anything. You can also easily interface with SparkSQL and MLlib for database manipulation and machine learning.

It will be much easier to start working with real-life large clusters if you have internalized these concepts beforehand.

However, unlike most Python libraries, starting with PySpark is not as straightforward as pip install and import. Most users with a Python background take this workflow for granted. However, the PySpark+Jupyter combo needs a little bit more love than other popular Python packages.

In this brief tutorial, I'll go over, step-by-step, how to set up PySpark and all its dependencies on your system and integrate it with Jupyter Notebook.

This tutorial assumes you are using a Linux OS. That's because in real life you will almost always run and use Spark on a cluster using a cloud service like AWS or Azure. Those cluster nodes probably run Linux.

It is wise to get comfortable with a Linux command-line-based setup process for running and learning Spark. If you're using Windows, you can set up an Ubuntu distro on a Windows machine using Oracle Virtual Box.

Installation and setup

Python 3.4+ is required for the latest version of PySpark, so make sure you have it installed before continuing. (Earlier Python versions will not work.)

python3 --version

Install the pip3 tool.

sudo apt install python3-pip

Install Jupyter for Python 3.

pip3 install jupyter

Augment the PATH variable to launch Jupyter Notebook easily from anywhere.

export PATH=$PATH:~/.local/bin

Choose a Java version. This is important; there are more variants of Java than there are cereal brands in a modern American store. Java 8 works with UBUNTU 18.04 LTS/SPARK-2.3.1-BIN-HADOOP2.7, so we will go with that version.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Check the installation.

java -version

Set some Java-related PATH variables.

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=/usr/lib/jvm/java-8-oracle/jre

Install Scala.

sudo apt-get install scala

Check the Scala installation.

scala -version

Install py4j for the Python-Java integration.

pip3 install py4j

Install Apache Spark; go to the Spark download page and choose the latest (default) version. I am using Spark 2.3.1 with Hadoop 2.7. After downloading, unpack it in the location you want to use it.

sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz

Now, add a long set of commands to your .bashrc shell script. These will set environment variables to launch PySpark with Python 3 and enable it to be called from Jupyter Notebook. Take a backup of .bashrc before proceeding.

Open .bashrc using any editor you like, such as gedit .bashrc. Add the following lines at the end:

export SPARK_HOME='/{YOUR_SPARK_DIRECTORY}/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Remember to replace {YOUR_SPARK_DIRECTORY} with the directory where you unpacked Spark above.

You can check your Spark setup by going to the /bin directory inside {YOUR_SPARK_DIRECTORY} and running the spark-shell –version command. Here you can see which version of Spark you have and which versions of Java and Scala it is using.

That's it! Now you should be able to spin up a Jupyter Notebook and start using PySpark from anywhere.

For example, if I created a directory ~/Spark/PySpark_work and work from there, I can launch Jupyter.

But wait… where did I call something like pip install pyspark?

I didn't. PySpark is bundled with the Spark download package and works by setting environment variables and bindings properly. So you are all set to go now!

Next on this topic

I am working on a detailed introductory guide to PySpark DataFrame operations. If you have any questions or ideas to share, please contact me at tirthajyoti[AT]gmail.com. If you are, like me, passionate about machine learning and data science, please add me on LinkedIn or follow me on Twitter. Also, check my GitHub repo for other fun code snippets in Python, R, or MATLAB and some other machine learning resources.

Originally published on FreeCodeCamp. Licensed under CC BY-SA 4.0.