- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Storm exactly-once semantics
- Labels:
-
Apache Hadoop
-
Apache Storm
Created ‎01-12-2017 04:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a question about Storm Tridents exactly-once semantics and how it would behave in the following scenario:
Suppose I have a topology that has 3 outputs to sink to; Kafka topic, Hbase table and a HDFSBolt. When a Trident batch is written to Kafka and HBase you can have strong guarantees that the writes are actually ack'ed or not. But for writes to HDFS you don't have that.
So do HDFSBolts boast the very same strong exactly-once guarantee? What would/could be scenario's that result in Trident batches being written twice to HDFS? Or is this a negligible risk? I need to know if there is any reason to built deduplication logic based on the data that lands in HDFS via the Storm Bolt.
Created ‎01-19-2017 03:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.
Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.
Created ‎01-19-2017 03:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file.
Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.
