Created on 07-11-2016 03:03 PM
For a long time, when there was a big job to do, people relied on horses. Whether the job required heavy pulling, speed, or anything in between. However, not all horses were fit for every task. Certain breeds were valued for their incredible speed and endurance, especially when an important letter had to be delivered. Others were prized for their ability to carry and pull large payloads whether it was a fully armored knight or huge stone blocks. Today we rarely rely on horses and much more on technology to get many of the same kinds of jobs done. Very high volume streaming data is increasingly common in all lines of business because of the value and utility that it often carries. Horses will be of little help with this type of workload but luckily there is a whole host of tools to deal with streaming data. However, just like with horses, choosing the the right streaming tool for a particular use case is critical to the success of the project.
Consider a use case where a directory full of log files with log entries, need to be broken down into individual events, filtered/altered, turned into JSON, and then sent on to a Kafka topic. This use case has exactly the kind of requirements that Apache Nifi was designed for. All you would have to do is string together ListHDFS, FetchHDFS, SplitText, ExtractText, AttributesToJSON, and finally PutKafka processor. This Nifi flow would distributed each file in the target directory across the Nifi cluster, extract/alter the events, output each event as JSON, and send them to a Kafka topic. Notice that not a single line of code was required to solve the use case. The same use case can be solved using Spark Streaming but would require a lot of code and an intricate understanding of how and where Spark stores and processes DStreams. This article :
http://allegro.tech/2015/08/spark-kafka-integration.html
does a great job of outlining how to achieve the same result but required several iterations by a team of engineers familiar with the Spark Streaming. The article explains the importance of understanding which instructions will execute on the driver and which will execute on the executors. It also describes an elegant approach that uses a factory pattern to distribute uninstantiated Kafka producer templates and how to make sure that the templates are only instantiated by the executors, thus avoiding the dreaded "Not Serializable" exception. That is a lot of work to solve such a basic use case.
Spark is one of the leading tools for complex computation and aggregation on very large volumes of data. It is extremely well suited for machine learning, time series/stateful calculation, aggregation, graph processing, and iterative computation. However, due to its highly distributed nature, simple event processing that only requires event routing, data transformation, data enrichment, and data orchestration is harder to achieve. Conversely, Apache Nifi is not the right tool to solve most of the complex computation and processing use cases. This is because it was designed to reduce the amount of effort required create, manipulate, and control streams of events as dynamically as possible. It is a graphical UI based distributed framework that is easy to extend and can perform most of the simple event processing tasks out of the box with very little effort or prior experience required. However, there are many enterprise class use cases that are best solved by using both Spark and Nifi together.
Consider one more example where a large organization dealing with millions of IOT enabled devices across the country needs to apply predictive algorithms on aggregated data in near realtime. They will need all of the event data to eventually arrive at two or three (in some cases one) processing centers in order to achieve their goals. At the same time, they need to make sure that lost events due to outages are as limited as possible. Such an organization will have many smaller data centers that have small infrastructure foot prints throughout the country. It is not practical to put a Spark Streaming cluster in those data centers nor is it safe to just point all of the devices at one or two data centers with heavy processing footprints. One of the possible approaches could be to run Nifi at the smaller data centers spread across a handful of servers to capture and curate the local event streams. Those streams can then be forwarded as cleaned and secured data sources to the two or three large data centers with large Spark Streaming clusters to apply all of the heavy and complex processing. This approach would address the need for failure group reduction/isolation, minimize the effort required to manage the distributed data collection infrastructure, allow dynamic updates to event manipulation and routing logic, and provide all of the heavy processing capabilities required to apply the intelligence and reporting required to address the business requirements.
To conclude our metaphor, Apache Spark Streaming is the heavy war horse and Apache Nifi is the quick and nimble race horse. Some workloads are best suited for one, some for the other, and some will require both working together.
Created on 09-12-2017 01:53 PM
Excellent Article! Thanks for sharing your thoughts.