How to process real-time data with Apache tools

Open source is leading the way with a rich canvas of projects for processing real-time events.

Image by:

Matteo Ianeselli. Modified by Opensource.com. CC-BY-3.0.

In the "always-on" future with billions of connected devices, storing raw data for analysis later will not be an option because users want accurate responses in real time. Prediction of failures and other context-sensitive conditions require data to be processed in real time—certainly before it hits a database.

It's tempting to simply say "the cloud will scale" to meet demands to process streaming data in real time, but some simple examples show that it can never meet the need for real-time responsiveness to boundless data streams. In these situations—from mobile devices to IoT—a new paradigm is needed. Whereas cloud computing relies on a "store then analyze" big data approach, there is a critical need for software frameworks that are comfortable instantly processing endless, noisy, and voluminous streams of data as they arrive to permit a real-time response, prediction, or insight.

For example, the city of Palo Alto, Calif. produces more streaming data from its traffic infrastructure per day than the Twitter Firehose. That's a lot of data. Predicting city traffic for consumers like Uber, Lyft, and FedEx requires real-time analysis, learning, and prediction. Event processing in the cloud leads to an inescapable latency of about half a second per event.

We need a simple yet powerful programming paradigm that lets applications process boundless data streams on the fly in these and similar situations:

Data volumes are huge, or moving raw data is expensive.
Data is generated by widely distributed assets (such as mobile devices).
Data is of ephemeral value, and analysis can't wait.
It is critical to always have the latest insight, and extrapolation won't do.

Publish and subscribe

A key architectural pattern in the domain of event-driven systems is the concept of pub/sub or publish/subscribe messaging. This is an asynchronous communication method in which messages are delivered from publishers (anything producing data) to subscribers (applications that process data). Pub/sub decouples arbitrary numbers of senders from an unknown set of consumers.

In pub/sub, sources publish events for a topic to a broker that stores them in the order in which they are received. An application subscribes to one or more topics, and the broker forwards matching events. Apache Kafka and Pulsar and CNCF NATS are pub/sub systems. Cloud services for pub/sub include Google Pub/Sub, AWS Kinesis, Azure Service Bus, Confluent Cloud, and others.

Pub/sub systems do not run subscriber applications—they simply deliver data to topic subscribers.

Streaming data often contains events that are updates to the state of applications or infrastructure. When choosing an architecture to process data, the role of a data-distribution system such as a pub/sub framework is limited. The "how" of the consumer application lies beyond the scope of the pub/sub system. This leaves an enormous amount of complexity for the developer to manage. So-called stream processors are a special kind of subscriber that analyzes data on the fly and delivers results back to the same broker.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. Often, Apache Spark Streaming is used as a stream processor, for example, to feed machine learning models with new data. Spark Streaming breaks data into mini-batches that are each independently analyzed by a Spark model or some other system. The stream of events is grouped into mini-batches for analysis, but the stream processor itself must be elastic:

The stream processor must be capable of scaling with the data rate, even across servers and clouds, and also balance load across instances, ensuring resilience and other application-layer needs.
It must be able to analyze data from sources that report at widely different rates, meaning it must be stateful—or store state in a database. This latter approach is often used when Spark Streaming is used as the stream processor and can cause performance problems when ultra-low latency responses are needed.

A related project, Apache Samza, offers a way to process real-time event streams, and to scale elastically using Hadoop Yarn or Apache Mesos to manage compute resources.

Solving the problem of scaling data

It's important to note that even Samza cannot entirely alleviate data processing demands for the application developer. Scaling data rates mean that tasks to process events need to be load-balanced across many instances, and the only way to share the resulting application-layer state between instances is to use a database. However, the moment state coordination between tasks of an application devolves to a database, there is an inevitable knock-on effect upon performance. Moreover, the choice of database is crucial. As the system scales, cluster management for the database becomes the next potential bottleneck.

This can be solved with alternative solutions that are stateful, elastic, and can be used in place of a stream processor. At the application level (within each container or instance), these solutions build a stateful model of concurrent, interlinked "web agents" on the fly from streaming updates. Agents are concurrent "nano-services" that consume raw data for a single source and maintain their state. Agents interlink to share state based on real-world relationships between sources found in the data, such as containment and proximity. Agents thus form a graph of concurrent services that can analyze their own state and the states of agents to which they are linked. Each agent provides a nano-service for a single data source that converts from raw data to state and analyzes, learns, and predicts from its own changes and those of its linked subgraph.

These solutions simplify application architecture by allowing agents—digital twins of real-world sources—to be widely distributed, even while maintaining the distributed graph that interlinks them at the application layer. This is because the links are URLs that map to the current runtime execution instance of the solution and the agent itself. In this way, the application seamlessly scales across instances without DevOps concerns. Agents consume data and maintain state. They also compute over their own state and that of other agents. Because agents are stateful, there is no need for a database, and insights are computed at memory speed.

Reading world data with open source

There is a sea change afoot in the way we view data: Instead of the database being the system of record, the real world is, and digital twins of real-world things can continuously stream their state. Fortunately, the open source community is leading the way with a rich canvas of projects for processing real-time events. From pub/sub, where the most active communities are Apache Kafka, Pulsar, and CNCF NATS, to the analytical frameworks that continually process streamed data, including Apache Spark, Flink, Beam, Samza, and Apache-licensed SwimOS and Hazelcast, developers have the widest choices of software systems. Specifically, there is no richer set of proprietary software frameworks available. Developers have spoken, and the future of software is open source.

An introduction to Apache Hadoop for big data

Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project…