Created 01-12-2017 04:21 PM
Hi,
I have a question about Storm Tridents exactly-once semantics and how it would behave in the following scenario:
Suppose I have a topology that has 3 outputs to sink to; Kafka topic, Hbase table and a HDFSBolt. When a Trident batch is written to Kafka and HBase you can have strong guarantees that the writes are actually ack'ed or not. But for writes to HDFS you don't have that.
So do HDFSBolts boast the very same strong exactly-once guarantee? What would/could be scenario's that result in Trident batches being written twice to HDFS? Or is this a negligible risk? I need to know if there is any reason to built deduplication logic based on the data that lands in HDFS via the Storm Bolt.
Created 01-19-2017 03:13 AM
Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.
Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.
Created 01-19-2017 03:13 AM
Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.
Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.