A guide to Apache's Spark Streaming

Image by:

Opensource.com

Apache Spark is an open source cluster computing framework. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.

Spark Streaming was launched as a part of Spark 0.7, came out of alpha in Spark 0.9, and has been pretty stable from the beginning. It can be used in many almost real time use cases, such as monitoring the flow of users on a website and detecting fraud transactions in real time.

Why is Spark Streaming needed?

A Stateful Stream Processing System is a system that needs to update its state with the stream of data. Latency should be low for such a system, and even if a node fails, the state should not be lost (for example, computing the distance covered by a vehicle based on a stream of its GPS location, or counting the occurrences of word "spark" in a stream of data).

Batch processing systems like Hadoop have a high latency and are not suitable for near real time processing requirements. Storm guarantees processing of a record if it hasn’t been processed, but this can lead to inconsistency as a record could be processed twice. If a node running Storm goes down, then the state is lost. In most environments, Hadoop and Storm (or other stream processing systems) have been used for batch processing and stream processing, respectively. The use of two different programming models causes an increase in code size, number of bugs to fix, development effort, introduces a learning curve, and causes other issues. Spark Streaming helps in fixing these issues and provides a scalable, efficient, resilient, and integratabtle (with batch processing) system.

In Spark Streaming, batches of Resilient Distributed Datasets (RDDs) are passed to Spark Streaming, which processes these batches using the Spark Engine and returns a processed stream of batches. The processed stream can be written to a file system. The batch size can be as low as 0.5 seconds, leading to an end-to-end latency of less than 1 second.

Spark Streaming provides an API in Scala, Java, and Python. The Python API was introduced only in Spark 1.2 and still lacks many features. Spark Streaming allows stateful computations—maintaining a state based on data coming in a stream. It also allows window operations (i.e., allows the developer to specify a time frame and perform operations on the data flowing in that time window. The window has a sliding interval, which is the time interval of updating the window. If I define a time window of 10 seconds with a sliding interval of 2 seconds. I would be performing my computation on the data coming to the stream in the past 10 seconds and the window would be updating every 2 seconds.

Apart from these features, the strength of Spark Streaming lies in its ability to combine with batch processing. It's possible to create a RDD using normal Spark programming and join it with a Spark stream. Moreover, the code base is similar and allows easy migration if required—and there is zero to no learning curve from Spark.

In the example below, wordCount is a dStream; using the transform operation of the dStream, the dStream can be joined to another RDD spamInfoRDD. This RDD is generated as a part of Spark job.

val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(…) // RDD containing spam information val cleanedDStream = wordCounts.transform(rdd => { rdd.join(spamInfoRDD).filter(…) // join data stream with spam information to do data cleaning… })

Spark Streaming allows data to be ingested from Kafka, Flume, HDFS, or a raw TCP stream, and it allows users to create a stream out of RDDs. You can provide your RDDs and Spark would treat them as a Stream of RDDs. It even allows you to create your own receiver. Scala and Java APIs can read data from all the sources, whereas Python API can read only from TCP network.

Fault tolerance is the capability of a system to overcome failure. Fault tolerance in Spark Streaming is similar to fault tolerance in Spark. Like RDD partitions, dStreams data is recomputed in case of a failure. The raw input is replicated in memory across the cluster. In case of a node failure, the data can be reproduced using the lineage. The system can recover from a failure in less than one second.

Spark Streaming is able to process 100,000-500,000 records/node/sec. This is much faster than Storm and comparable to other Stream processing systems. At Sigmoid, we are able to consume 480,000 records per second per node machines using Kafka as a source. We are constantly pushing the limits of performance and trying to get maximum out of the same infrastructure.

Sigmoid worked with a leading supply-chain analytics firm to help them process real-time factory shop floor data using Spark Streaming. The key problem was to digest batch and streaming data from various sources, all within a turnaround time of seconds. They had a challenging requirement of ensuring easy scaling, high availability, and low latency for the system to be used in production. We overcame these challenges to implement a stable and maintainable real-time streaming system.

Spark Streaming is the best available streaming platform and allows to reach sub-second latency. The processing capability scales linearly with the size of the cluster. Spark Streaming is being used in production by many organizations, including Netflix, Cisco, Datastax, and more.

Thanks to Matei Zaharia, Tathagata Das, and other committers for open sourcing Spark under the Apache license.