Member since
11-07-2016
58
Posts
26
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1584 | 05-17-2017 04:57 PM | |
4633 | 03-17-2017 06:51 PM | |
2105 | 01-14-2017 07:03 PM | |
3294 | 01-14-2017 06:59 PM | |
1765 | 12-29-2016 06:45 PM |
01-18-2017
06:35 PM
Hi @Joby Johny, The quickest way to fix it is use HDP 2.5 instead of 2.4 Just reviewed the code for the topology and it uses Storm 1.0.1 which is incompatible with Storm 0.10.0 of HDP 2.4
... View more
01-16-2017
11:59 PM
@Jon Maestas Please accept the answer if this answered your questions.
... View more
01-14-2017
07:03 PM
Yes, this error is due to Storm version mismatch, seems like you are trying to run 2.5 code on 2.4 and the topology is already deployed.
... View more
01-14-2017
06:59 PM
2 Kudos
Hi @Jon Maestas Answering your questions inline: How does storm handle failed tuples? When you are using at least once processing (acking and anchoring) is when Storm will handle tuple failures by retries. Retry means re-emitting a tuple from Spout. How many times will storm retry a failed tuple? This depends on the Spout's logic, in case of Kafka Spout for 0.10.x Storm there's the ability for exponential backoff retry (https://github.com/apache/storm/blob/0.10.x-branch/external/storm-kafka/src/jvm/storm/kafka/ExponentialBackoffMsgRetryManager.java) What frequency will storm retry the failed tuple? ExponentialBackoff will determine the frequency. What is the max tuple count a topology can handle between all spouts and bolts? I am guessing you are asking for maximum number of tuples at any given point can be in Storm's buffers? This = Bolt Count * Executor Count * TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE + Bolt Count * Executor Count * TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE You can find out the value of these buffers from Ambari -> Storm -> Config -> Search "buffer" Please note that the above is theoretical maximum, Max Spout Pending (topology.max.spout.pending) throttles the number of in-flight tuples from the Spout. There's also transfer buffers which will add bit more to the above calculated number. Please refer to Michael Noll's blog for more details about Storm Buffers (http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/) Hope this answers your questions.
... View more
01-05-2017
06:23 PM
5 Kudos
General Recommendations - Only fail tuples if there's a recoverable exception - Use an Exception/Error Logging Bolt/Stream and send the output to a Kafka topic so all logs are consolidated in place - Be careful about supervisor saturation / overallocation i.e. running more threads than available CPU resources - Configure parallelism based on throughput requirements (use Throughput Equation to calculate) - Be aware of data spikes and plan for them in your topology / use cases / operations - If using Kafka, setup Kafka Producer-Consumer Lag analysis and dashboard - Benchmark topologies to assess "health" operations numbers for individual bolt performance
## Troubleshooting topology performance issues: - Check the topology stats from the Nimbus UI - Look for Failures (in spout or specific bolt) if any that module is having issues. - If a topology has Kafka error bolt then open a console consumer to looks at the error logs from Kafka - Now find out what supervisor the worker is running on and start tailing the logs of the topology (log files are named by name of the topology and the deployment number; you can find this from the nimbus ui) If you are seeing lag building there can be 2 major reasons for this: - a topology component is slow (bolt or spout) - the end sink is having issues; again sink specific exceptions will be present in output of the error bolt; follow instructions above to see the logs Topology issues due to slow bolts Topologies can have performance issues because some bolts are running slow. Bolts can be either be running slow due to 2 root causes: 1. the system they interact with is slow 2. the topology received a data spikes Mostly slow bolts are usually the ones interacting with other external systems e.g. lookup bolts or writer bolts.
Such bolts are slow because their external systems are experiencing transient performance slowdowns e.g. elasticsearch recovery, hbase region split, major gc etc. **How to find out what's the root cause of a slow bolt?** Analyze the capacity numbers and execute latencies, a slow bolt is the one with capacity >1.0 meaning there aren't enough bolt instances for the amount of data being processed.
Now next step is to find out if the root cause is external system slowdown or data spike. If execute latencies are reasonable i.e. around your normal value the external system is not the bottleneck therefore the root cause is data spike. Troubleshooting external system bottleneck External system bottleneck should be resolved before topology performance can be restored. Sometimes it may be beneficial to pause or stop the topology in order to recover the external system e.g. topology based data ingestion. ``Note: Sometimes data spike can trigger external system slowdown`` Troubleshooting data spike Data spikes can be resolved by adding more parallelism to the topology. This implies balanced addition of executors and workers i.e. incrementally adding more parallelism to the specific bolt and analyzing capacity numbers. This process should be repeated until the bolt capacity is below 1.0 ``Note: Add more executors only when the bolt latencies are low or code is linearly scaling`` Calculating current throughput Vs. expected throughput is critical for determining the correct configurations for parallelism and worker count. **Throughput = Executor count * 1000/(process latency) * Capacity** ``Note: Adding more executors doesn’t solve the problem if the supervisors are saturated`` Adding more workers to the topology to evenly balance out the load across supervisors is critical for good performance. If there are more executors scheduled per worker than there's CPU capacity for that worker this will result in CPU resource contention leading to cascading performance effects across other topology workers running on that supervisor. A key way to analyzing over allocation is summing the number of executors in the cluster and dividing them by sum of CPU cores across all supervisors. If this number is greater than 2 than you have overallocated resources however this may not result in performance issues as overallocating executors for IO bound bolts is sometimes required for performance and parallelism. Known Issues Ackers Number of ackers in Storm may get bottlenecked if there are too few of them. They may also get bottlenecked if there aren't enough ackers per worker because of network issues (seen this in AWS multi-az deployments) Tuple Failures Tuple failures happen because of 3 reasons: 1. Tuple timeouts Tuple timeouts are the trickiest to resolve. They could happen due to any of the following reasons: - Slow sink and bad max spout pending - Micro-batching and un-tunned batch sizes (data is not coming is fast enough for your batches to fill and you don't have time based acking) - Un-tunned max spout pending setting - In-correct tuple message timeout (default 30 seconds) 2. Bolt Exceptions Bolt exceptions will show in topology worker logs. 3. Explicit failures in the code i.e. calling collector.fail() method Explicit failures will be shown in Nimbus UI, in failure counts corresponding to the respective bolts. Saturated Workers Storm has the concept of slots which can only allow to limit saturation and overallocation on memory but not CPU / threads. This is something that has to be manually maintained by engineers as in be careful not to over provision number of threads (executors) / workers.
Symptoms: topology will appear to run slow without any tuple failures however lag is going to continue to build.
... View more
Labels:
12-29-2016
07:07 PM
To summarize, G1GC provides predictable GC times which is critical for real-time applications (like Kafka, Storm, Solr etc.).
The reasoning is to avoid stop the world garbage collection which will result in back pressure in heavy ingest environments.
... View more
12-29-2016
06:45 PM
Setup: To get maximum fault tolerance / performance, install Kafka on each node (source server) so you will end up with 3 nodes. Additionally you will also need zookeeper which you can once again install on all 3 nodes. The above setup is recommended if this is a PoC, however for production use it's recommended to have Kafka + Zookeeper on nodes other than your source nodes to provide fault tolerance. Additionally, Kafka uses a lot of OS page caching which may interfere with the application running the 3 nodes. Just to clarify Kafka shouldn't be confused with Flume, it's a MessageBroker service; you are responsible for ingesting data (use Flume) or reading data (e.g. Storm / Spark) Scaling: Scaling Kafka is a 3 step operation; step 3 is optional but recommended: Add nodes to the cluster (Use Ambari to add nodes) Alter Topic and add additional partitions (1 partition / node) (Optional) Rebalance Kafka
On a side note: Kafka works best when used in clustered mode, you can use single-node Kafka however it it fundamentally defeats the purpose of Kafka (partitioning and fault-tolerance)
... View more
12-28-2016
06:27 PM
I looked through the commit history for 0.10.x (Storm), don't see this line marked as INFO level.
The line is just a constructor initialization message therefore you can ignore it.
... View more
12-27-2016
06:27 PM
2 Kudos
It originates from Zookeeper client (Curator) and it's a class implementing exponential back off for zookeeper connection (Storm).
It's a DEBUG message in the current code base https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/utils/StormBoundedExponentialBackoffRetry.java#L49
What version of Storm are you using?
... View more
11-23-2016
06:39 PM
@Constantin Stanca thank you for the feedback. I have reworded the point and added some explanations; please let me know your thoughts.
... View more
- « Previous
-
- 1
- 2
- Next »