Support Questions

jknulst · ‎01-12-2017

Hi,

I have a question about Storm Tridents exactly-once semantics and how it would behave in the following scenario:

Suppose I have a topology that has 3 outputs to sink to; Kafka topic, Hbase table and a HDFSBolt. When a Trident batch is written to Kafka and HBase you can have strong guarantees that the writes are actually ack'ed or not. But for writes to HDFS you don't have that.

So do HDFSBolts boast the very same strong exactly-once guarantee? What would/could be scenario's that result in Trident batches being written twice to HDFS? Or is this a negligible risk? I need to know if there is any reason to built deduplication logic based on the data that lands in HDFS via the Storm Bolt.

aiyer1 · ‎01-19-2017

@Jasper

Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.

Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.

View solution in original post

aiyer1 · ‎01-19-2017

@Jasper

Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.

Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.

Cloudera Community

Support Questions

Storm exactly-once semantics

At least once delivery vs exactly once delivery se...

How do we achieve exactly-once processing in Nifi?

Unofficial Storm and Kafka Best Practices Guide

Debugging an Apache Storm topology

Apache Storm Topology Tuning Approach

Solutions for Storm Nimbus Failure

Unified BI Semantic Layer

Windowing and State checkpointing in Apache Storm

How do you use GetFile to read the latest file exa...

Storm Topology Runbook