Support Questions

Find answers, ask questions, and share your expertise

Storm exactly-once semantics

avatar
Super Collaborator

Hi,

I have a question about Storm Tridents exactly-once semantics and how it would behave in the following scenario:

Suppose I have a topology that has 3 outputs to sink to; Kafka topic, Hbase table and a HDFSBolt. When a Trident batch is written to Kafka and HBase you can have strong guarantees that the writes are actually ack'ed or not. But for writes to HDFS you don't have that.

So do HDFSBolts boast the very same strong exactly-once guarantee? What would/could be scenario's that result in Trident batches being written twice to HDFS? Or is this a negligible risk? I need to know if there is any reason to built deduplication logic based on the data that lands in HDFS via the Storm Bolt.

1 ACCEPTED SOLUTION

avatar
Cloudera Employee
@Jasper

Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.

Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.

View solution in original post

1 REPLY 1

avatar
Cloudera Employee
@Jasper

Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.

Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.