“In meetings philosophy might work, [but in] the field practicality works.” –Amit Kalantri
Short Description:
A quick tutorial on how to query cryptocurrency transactions with Spark Streaming
Abstract:
This tutorial covers how once can use Nifi to stream public cryptocurrency transactional data to Kafka and consequently query the transactional stream with Spark Streaming. We will be using the Pyspark API for demonstration purposes. It is strongly recommended that one use Java or Scala for production grade applications; as most of the generally available (GA) features in Spark are published to support those languages foremost. Figures are used as a user graphical aid. Steps are highlighted by bullet points.
What is Spark? Why is it well suited for Streaming at scale?
Apache Spark is a distributed processing engine for data science and data engineering at scale. It features several out of the box libraries for machine learning, map reduce, graph analysis, micro batching, and structured query language. It is well suited to poll streams of data due to its language-integrated application programming interface (API) and fault tolerant behavior. Users can write stream jobs the same way they would batch and recover both lost work and operator state without extra code. The Spark team has conveniently encapsulated these concerns into Spark’s core API.
With Spark Streaming, one can ingest data from multiple sources (Kafka, Kinesis, Flume, and TCP sockets) and process data with abstracted functions like map, reduce, join, and window. Results can subsequently be pushed to HDFS, databases, and dashboards for meaningful insights.
Simply put, Spark Streaming provides a very convenient way for Hadoop developers to grab data in realtime and persist it to their system of choice.
What is micro batching?
“Micro batching is defined as the procedure in which the incoming stream of messages is processed by dividing them into group of small batches. This helps to achieve the performance benefits of batch processing; however, at the same time, it helps to keep the latency of processing of each message minimal”. –Apache Spark 2.x for Java Developers , Sumit Kumar & Sourav Gulati, O’Reilly Media Inc.
In Spark Streaming a micro batch is a sequence of resilient distributed datasets (RDDs) packaged into an abstraction known as a Discretized Streams (Streams). Users can batch DStreams on time intervals, with batches ranging between 1-10 seconds. Determining the batch size really depends on the needs of the downstream system consumer and overall applicability of how fast a consumer can write the batches to disk or present information to a web browser.
Where are we getting our data?
We will be subscribing to a Satori public data channel for global cryptocurrency transactions. According to Satori one can,
“Get streaming data for cryptocurrency market prices as well as exchange rates between alternative cryptocurrencies. Data is streamed from exchanges around the globe, where crypto-coins are traded in the US Dollar, Canadian Dollar, British Pound, Japanese Yen, Russian Ruble, etc.” -https://www.satori.com/livedata/channels/cryptocurrency-market-data
Note:
This tutorial assumes the user has successfully installed Spark2 on HDP and Nifi and Kafka on HDF. It also assumes the user has some familiarity with Nifi and Kafka; but it isn’t required to successfully complete the tutorial. For detailed instructions on how to accomplish these steps please consult our HDP and HDF user documentation:
One can create the application logic for this tutorial by cloning my github repository. Feel free to make contributions by forking the repo and submitting pull requests.