Enterprises are awash in data, and though many are tempted to save it all for later analysis – after all it worked for Google for many years – the store then analyze approach is poorly suited to environments with data sources that never stop. Storing all data is costly if you have a vast amount of it or it is boundless in nature. On top of that, getting the data to the storage infrastructure then processing it fast enough become major challenges. Ultimately, both bandwidth and storage costs add up, leaving architects with another puzzle – how to process data to get useful insights.
Analysis in real-time, as data flows, places an emphasis on analyze first, then store insights and is crucial when the following conditions apply:
- Data volumes are large, and moving raw data is expensive
- Data is generated by widely distributed assets (for example, mobile devices)
- Data is of ephemeral value (typical), and analysis can’t wait
- Quick analysis to deliver durable insights is more important than raw data
- It is important to always have the current insight – an extrapolation from the last batch run won’t do
Use cases for which “analyze first” is imperative include prediction of failures on assembly lines, prediction of traffic flows in cities or demand placed on power grids, detection of hackers in a critical infrastructure network, or understanding the state of each mobile handset for customer care. These use cases are characterized by a need to know – now – and require real-time processing of streaming data. So the new goal is to enable real-time stream processing in which analysis, learning and prediction are done on-the-fly, with continuous insights streamed instead of (or in addition to) streaming data. Real-time insights allow applications to respond quickly, rather than waiting for the next batch run. In addition, this needs to be made easy; accessible to every organization, and requiring only modest application development skills in popular languages like Java, Python and JavaScript. We need solutions for real-time streaming insights that are available to the masses.
Smart Dataflow Pipelines
Streaming data contains the continuous stream of changes to real world assets, systems, accounts, infrastructure and even people. The need to deliver insights continuously demands an architecture where streaming data is continuously processed – both to permit a real-time response, and to avoid storage and networks overflowing with volumes of data of ephemeral value. What’s needed is an architecture to analyze, act, then store – and to achieve this, the processing and analysis needs to be performed in the data pipeline. We think there’s a simpler way to get insights – moving analysis into the data flow, something that is being called a smart dataflow pipeline.
Streaming architectures today use centralized (but clustered) analysis, and it is appropriate to compare the smart dataflow pipelines to the current approach, which generally makes use of Spark Streaming and where the data must first be transported to the Spark platform and application using a data pipeline of some sort (such as Kafka or Pulsar).
Today’s Approach: Spark Streaming
There are three challenges that developers must overcome to successfully use Spark to process streaming data:
- State matters, not data: Streaming environments never stop producing data – typically real-world events – but analysis is dependent on the meaning of those events, or the state changes that the data represents. Spark (and Spark Streaming) applications operate on raw data. So voluminous data must be transported to the Spark cluster, then the application needs to sift through the data to understand the relevant state changes – before analysis. Since Spark is stateless this state must be saved in a database.
- Building applications is complex: The application developer must explicitly manage the data-to-state conversion, state storage, and distribution of analysis tasks across the Spark cluster. This is complex, difficult to maintain, and understand. Often the model of the real world is deeply ingrained in the application and relies on deep domain expertise. But such skillsets are in short supply.
- Infrastructure headaches: Spark platforms are centralized (but clustered), and application developers are intimately involved with distribution of computation within a cluster. A data pipeline delivers raw data to a cluster where both data-to-state conversion and analysis happen.
Managing and processing data in Spark requires programming expertise, processing of entire data stores creating high storage costs and time delays. It effectively requires centralized processing and analysis that eliminates flexibility and can cause delays. (please revise this summary for accuracy)
Towards Smart Dataflow Pipelines
In order to deliver on effective data analysis, it is necessary to decentralize processing and analysis, and in so doing, use edge processing as much as possible. Another critical aspect required for success is the need to simplify application development. To achieve these goals and create smart dataflow pipelines we borrow key concepts from the reactive paradigm: stateful processing by concurrent actors – or digital twins.
In this new scenario, each real-world “thing” has a concurrent digital twin that consumes raw data and represents the state of the “thing.” The collection of actors is a mirror of the state of the real-world assets. Using this approach offloads raw data processing to the digital twins, which can run in a distributed, edge-based environment. This way, the transition from data to state is facilitated cheaply, at the edge of the network because all of the data does not need to go into storage then be accessed from storage to be processed.
For the next step which is to facilitate analysis, we need a model that captures the relationships between data sources. Smart dataflow pipelines do this a little like building a “LinkedIn for things.” Here is how it works: the actors that are digital twins of real-world assets inter-link to form a graph, based on real-world relationships discovered in the data, like containment or proximity. These relationships are easily specified by the developer (eg: an intersection contains lights, loops and pedestrian buttons, or any other number of variables). But data builds the graph: as data sources report their status, they are linked into the graph.
Like LinkedIn, we allow linked sources to share their state changes, making them visible. So, for example, an intersection that links to its neighbors can see their state – in real time. Suddenly, analysis becomes possible too: digital twins can analyze their own state and the states of others they are linked to – enabling learning and prediction. Note that this is a different approach to learning than training complete models of a whole environment – here each part of the environment (that is, each intersection) predicts for itself.
Smart dataflow pipelines have an architecture that is naturally distributed, and perform analysis much closer to where the actual data is generated. Each digital twin statefully evolves and analyzes its own state and the state of its linked “neighbors,” and then streams its insights in real time.
In today’s connected world, where we are awash in data just waiting to be turned into business insights, it is time to look for new technologies that can actually help achieve true, real-time analysis, and do it at a price point that businesses of any size can afford. We need new technologies that enable accurate data analysis for the masses.
Simon Crosby is chief technology officer at SWIM.AI, which has a platform that does machine learning at the edge. Prior to this, Crosby was chief technology officer at virtualization security vendor Bromium and had the same title at the Data Center and Cloud division of Citrix Systems thanks to his co-founding of XenSource, the commercial entity behind the Xen hypervisor, which was created in 2003 and acquired by Citrix in 2007.