Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Rising Star

The objective of this article is to have a discussion and capture critical points for consideration when developing a Streaming Application using Storm, Spark or any other streaming technology. Intent is to add additional points from discussion and comments to evolve this article into a comprehensive guideline for Stream Engineering.

Here are a few points synthesized from my experience developing streaming applications using Storm and Flume:

1. Select Streaming Application’s SLAs requirements

Most Streaming Applications can be classified based on their SLA requirements into:

  • At-most once (data drops are acceptable)
  • At-least once (data drops are NOT acceptable)
  • Exactly once (idempotent computations)

These requirements tend to heavily drive design requirements like whether or not to ack tuples (in Storm), or if downsampling is acceptable (sensor data processing)

2. Minimal and careful data replays

In Guaranteed Processing SLA use cases, data replay must be minimized by application logic to avoid situations like replay loops and heavy back pressure. This situation is seen in poorly designed topologies where incorrect exception handling leads to a single datapoint being infinitely replayed.

3. Minimize processing latencies

Latencies for processing individual events/tuples adds up to the cumulative processing latencies therefore streaming application must be engineered for low latency using appropriate technologies for local in-memory caching and micro-batching where possible to minimize network latencies and amortize the cost over several calls.

4. Tradeoffs between Throughput and Latency

Performance tradeoffs usually are between Throughput Vs. Latency and can be tuned using Micro-batching (Storm Trident or Spark Streaming). Micro-batching approach may use time-based or size-based batches both of which have some caveats.

Size-based micro-batching may add latencies since the buffer must fill up before processing applied or transfers are executed and is therefore subject to event velocity. This model can't be used if there are hard latencies limits for the application however, if a sustained minimum event rate (velocity) is guaranteed then micro-batching can be applied while preserving acceptable latencies.

Time-based micro-batching is subject to stability and performance issues if spikes (volume and velocity) are not accounted. Time-based micro-batching can satisfy hard-latency constraints of a use case.

A hybrid model of micro-batching can also be deployed which uses both size based and time based batching and has hard limits on both to guarantee low latencies and high throughput while providing stability.

5. Aggregation Bottlenecks

Use cases where Streaming Aggregations are performed must

  • account for event volume at the pivot point of aggregation (aggregation key) to avoid bottlenecks and pipeline back pressures
  • account for the distribution of event volume across aggregation keys

6. Polling and Event Sourcing

Polling and Event Sourcing are two prominent design patterns for updating configuration and logic in a Streaming pipeline. Logic may include but is not limited to: Dynamic Filtering, Caching for data enrichment and Machine Learning Models.

In Polling based design these updates are polled for at a pre-determined frequency in an out-of-band fashion (separate update thread).

In Event Sourcing based design these updates are delivered to the processing component as Event and the processing component has a separate code branch (if statement) to handle these kinds of events and execute logic update.

Using Event Souring approach allows for a lock-free design whereas Polling based design requires locking (at some point) and concurrent data structures to be used to guarantee consistency.

7. Error Handling

Error Handling must be given thorough design thought as incorrect error handling will lead to application downtime and performance issues. Errors that are recoverable may allow for replays like network unavailability or failover, however unrecoverable errors must be written to a separate stream without blocking the primary execution pipeline. NiFi handles this very seamlessly using Success and Failure streams.

796 Views
Comments

@ambud.sharma

Could you develop more on point #4? Latency measures the amount of time between the start of an action and its completion, throughput is the total number of such actions that occur in a given amount of time. I doubt that lower latency is responsible for throughput decrease in a direct correlation. There is more to it. Based on #4, NYSE trading is better running in a batch mode. I believe you forget some other factor in the correlation, for example RESOURCES as a given constant, or single tuple overhead, etc. Please elaborate to increase the value of this article.

Rising Star

@Constantin Stanca thank you for the feedback. I have reworded the point and added some explanations; please let me know your thoughts.

@ambud.sharma

Voted up :). Before, it was counter-intuitive.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎11-19-2016 09:24 PM
Updated by:
 
Contributors
Top Kudoed Authors