About cstanca

cstanca · ‎10-21-2016

Introduction How many times we've heard that a problem can't be solved if it isn't completely understood? However, all of us, even when having a lot of experience and technical skill, we still tend to jump to the solution workspace by-passing key steps in the scientific approach. We just don’t have enough patience and, quite often, the business is on our back to deliver a quick fix. We just want to get it done! We found ourselves like that squirrel in the Ice Age chasing for the acorn just using our experience and minimal experiment, dreaming that the acorn will fall on our lap. This article is meant to re-emphasize a repeatable approach that can be applied to tune Storm topologies starting from understanding the problem, forming a hypothesis, testing that hypothesis, collecting the data, analyzing the data, drawing conclusions based on facts and less on empirical. Hopefully it helps to catch more acorns! Understand the Problem Understand the functional problem resolved by the topology; quite often the best tuning is executing a step differently in the workflow for the same end-result, maybe apply a filter first and then execute the step, rather executing the step forever and paying the penalty every time. You can also prioritize success or learn to lose in order to win. This is the time also to get some sense of business SLA and forecasted growth. Document the SLA by topology. Try to understand the reasons behind the SLA and identify opportunities for trade-offs. Define SLA for Business Success This is the time also to get some sense of business SLA and forecasted growth. Document the SLA by topology. Try to understand the reasons behind the SLA and identify opportunities for trade-offs. Gather Data Gather data using Storm UI and other tools specific to the environment. Storm UI is your first to go-to tool for tuning, most of the time. Another tool to use is Storm’s builtin Metrics-Collecting API. This is available with 0.9.x. This helps to built-in metrics in your topology. You may have tuned it, deployed the production, everybody is happy and two days later the issue comes back. Wouldn’t be nice to have some metrics embedded on your code, already focused on the usual suspects, e.g. external calls to a web service or a SQL to your lookup data store, specific customers or some other strong business entity that due to data and the workflow can become the bottleneck? In a storm topology, you have spouts, bolts, workers that can play a role in your topology challenges. Learn about topologies status, uptime, number of workers, executors, tasks, high-level stats across four time windows, spout statistics, bolt statistics, visualization of the spouts and the bolts and how they are connected, the flow of tuples between all of the streams, all configurations for the topology. Define Baseline Numbers Document statistics about how the topology performs in production. Setup a test environment and try that topology with different loads aligned to realistic business SLAs. Tuning Options vs. Bottleneck Increase parallelism (use rebalance command), analyze the change in capacity using the same Storm UI – if no improvement, focus on the bolt, that is the likely bottleneck; if you see backlog in your upstream (Kafka most likely), focus on the spout. After trying with spouts and bolts, change focus on workers’ parallelism. Basic principle is to scale on a single worker with executors until you find increasing executors does not work anymore. If adjusting spout and bolt parallelism failed to provide additional benefits, play with the number of workers to see if we are now bound by the JVM we were running on and needed to parallelize across JVMs. If you still don’t meet the SLA by tuning the existent spouts, bolts, and workers, it’s time to start controlling the rate that flows into topology. Max spout pending allows you to set a maximum number of tuples that can be unacked at any given time. If the number of possible unacked tuples is lower than the total parallelism you’ve set for your topology, then it could be a bottleneck. The goal is with one or many spouts to assure that the maximum possible unacked tuples is greater than the maximum number of tuples we can process based on our parallelization so we can feel safe saying that max spout pending isn’t causing a bottleneck. Without Max spout pending, tuples will continue to flow into your topology whether or not you can keep up with processing them. Max spout pending allows to control the ingest rate. Max spout pending lets us erect a dam in front of our topology, apply back pressure, and avoid being overwhelmed. Despite the optional nature of max spout pending, you always should set it. When attempting to increase performance to meet an SLA, increase the rate of data ingest by either increasing spout parallelism or increasing the max spout pending. Assuming a 4x increase in the maximum number of active tuples allowed, we’d expect to see the speed of messages leaving our queue increase (maybe not by a factor of four, but it’d certainly increase). If that caused the capacity metric for any of the bolts to return to one or near it, tune the bolts again and repeat with the spout and bolt until hit the SLA. These methods can be applied over and over until meet the SLAs. This effort is high and automating steps is desirable. Deal with External Systems It’s easy when interacting with external services (such as a SOA service, database, or filesystem) to ratchet up the parallelism to a high enough level in a topology that limits in that external service keep your capacity from going higher. Before you start tuning parallelism in a topology that interacts with the outside world, be positive you have good metrics on that service. You could keep turning up the parallelism on bolt to the point that it brings the external service to its knees, crippling it under a mass of traffic that it’s not designed to handle. Latency Let’s talk about one of the greatest enemies of fast code: latency. There’s latency accessing local memory and hard drive, and accessing another system over the network. Different interactions have different levels of latency, and understanding the latency in your system is one of the keys to tuning your topology. You’ll usually get fairly consistent response times and all of the sudden those response times will vary wildly because of any number of factors: The external service is having a garbage collection event. A network switch somewhere is momentarily overloaded. Your coworker wrote a runaway query that’s currently hogging most of the database’s CPU. Something about the data that’s likely to cause the delay. As the topology interacts with external services, let’s be smarter about our latency. We don't always need to increase parallelism and allocate more resources. We may discover after investigation that your variance is based on customer or induced by an exceptional event, e.g. elections, Olympics, etc. Certain records are going to be slower to look up. Sometimes one of those “fast” lookups might end up taking longer. One strategy that has worked well with services that exhibit this problem is to perform initial lookup attempts with a hard ceiling on timeouts, try various ceilings, if that fails, send it to a less parallelized instance of the same bolt that will take longer with its timeout. The end result is that the time to process a large number of messages goes down. Your mileage will vary with this strategy, so test before you use it. As part of your tuning strategy, you could end-up to break a bolt into two: one for fast lookups and the other for slow lookups. At least, what goes fast let it go fast, don’t back it up with a slow one. Treat the slow ones differently and make sure they are the exception. You may even decide to address those differently and still get some value of that data. It is up-to you to determine what is more important: having an overall bottleneck or mark for later processing or disregard those producing a bottleneck. Maybe a batch approach for those is more efficient. This is a design consideration and a reasonable trade-off most of the time. Are you willing to accept the occasional loss of fidelity but still hit the SLA because that is more important for your business? Is Perfection helping or killing your business? Summary All basic timing information for a topology can be found in the Storm UI. Metrics (both built-in and custom) are essential if you want to have a true understanding of how your topology is operating. Establishing a baseline set of performance numbers for your topology is the essential first step in the tuning process. Bottlenecks are indicated by a high capacity for a spout/bolt and can be addressed by increasing parallelism. Increasing parallelism is best done in small increments so that you can gain a better understanding of the effects of each increase. Latency that is both related to the data (intrinsic) and not related to the data (extrinsic) can reduce your topology’s throughput and may need to be addressed. Be ready for trade-offs References Storm Applied: Strategies for real-time event processing by Sean T. Allen, Matthew Jankowski, and Peter Pathirana

cstanca · ‎10-21-2016

Sizing There's no simple rule of thumb for this, it's as much an art as it is a science, as it depends on the workloads and how chatty they are with your current ZKs. One way to look at this is: If you have 3 ZKs you can afford to lose one, if you have 5 you can afford to lose two. If your IT is aggressively applying security patches and other upgrades, like firmware, kernel, Java, other packages used by Hadoop tools, and taking nodes down to do the job, then during those upgrades with 3 ZKs, you ZK runs with only two nodes, and if you are unlucky and one of them goes down, then your whole cluster will go down. So, in this case 5 are better. Warning: the more ZK nodes you have, the slower the ZK becomes for writes. Placement Zookeeper is a master node, as such it can be collocated with other master services. Ideally, you would not want to collocate it with an HA service. It is quite light on memory and CPU requirements, but since is disk intensive, don't collocate it with disk-intensive services like Kafka or HDFS. Storage Requirements In general, Zookeeper doesn't actually require huge drives because it will only store metadata information for many services, It is common to use 100G to 250G for zookeeper data directory and logs which is fine of many cluster deployments. Moreover, it is recommended to set configuration for automatic purging policy of snapshots and logs directories so that it doesn't end up by filling all the local storage. Dedicated or Shared? At Yahoo!, ZooKeeper is usually deployed on DEDICATED RHEL boxes, with dual-core processors, 2GB of RAM, and 80GB IDE hard drives. For your Kafka/Storm cluster, you could consider deploying ZK on DEDICATED physical hardware (not virtual). The driving force for physical hardware or at least for the dedicated disk is the transaction log and the high throughput nature of Kafka and Storm. Since Kafka is usually used with Storm, have a separate Zookeeper cluster for Kafka and Storm. Kafka and Storm are sharing then, please make sure that you don’t put the Zookeeper cluster on the Kafka nodes. Put the Zookeeper on the Storm nodes. Caution Rather than going to larger clusters of ZKs, it is better to split out certain services to their own ZKs when they're putting more pressure on an otherwise fairly quiet ZK cluster. It is a good thing to have separate set of ZK for each cluster, one quorum for Kafka, one quorum for Storm, one quorum for the rest (YARN, HBase, Hive, HDFS), possibly separate zookeepers for HBase. Challenge is that more hardware is needed and more administration, but it pays off. Be careful where you put that transaction log. The most performance-critical part of ZooKeeper is the transaction log. ZooKeeper must sync transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely impact performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it can mitigate it. ZooKeeper's transaction log must be on a dedicated device. A dedicated partition is not enough. ZooKeeper writes the log sequentially, without seeking. Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays. Do not put ZooKeeper in a situation that can cause a swap. In order for ZooKeeper to function with any sort of timeliness, it simply cannot be allowed to swap. Remember, in ZooKeeper, everything is ordered, so if one request hits the disk, all other queued requests hit the disk. Going to disk unnecessarily will almost certainly degrade your performance unacceptably. Therefore, make certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper. Set your Java max heap size correctly. To avoid swapping, try to set the heapsize to the amount of physical memory you have, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is to run load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a 4G machine, a 3G heap is a conservative estimate to start with. Best Practices The ZooKeeper data directory contains the snapshot and transactional log files. It is a good practice to periodically clean up the directory if the auto-purge option is not enabled. Also, an administrator might want to keep a backup of these files, depending on the application needs. However, since ZooKeeper is a replicated service, we need to back up the data of only one of the servers in the ensemble. ZooKeeper uses Apache log4j as its logging infrastructure. As the logfiles grow bigger in size, it is recommended that you set the auto-rollover of the logfiles using the in-built log4j feature for ZooKeeper logs. The list of ZooKeeper servers used by the clients in their connection strings must match the list of ZooKeeper servers that each ZooKeeper server has. Strange behaviors might occur if the lists don't match. The server lists in each Zookeeper server configuration file should be consistent with the other members of the ensemble. The ZooKeeper transaction log must be configured in a dedicated device. This is very important to achieve best performance from ZooKeeper. The Java heap size should be chosen with care. Swapping should never be allowed to happen in the ZooKeeper server. It is better if the ZooKeeper servers have a reasonably high memory (RAM). System monitoring tools such as vmstat can be used to monitor virtual memory statistics and decide on the optimal size of memory needed, depending on the need of the application. In any case, swapping should be avoided. References: https://community.hortonworks.com/questions/2498/best-practices-for-zookeeper-placement.html https://community.hortonworks.com/questions/53025/zookeeper-performance-and-metrics-when-to-resize.html https://community.hortonworks.com/questions/55868/zookeeper-on-even-master-nodes.html Apache ZooKeeper Essentials by Saurav Haloi Published by Packt Publishing, 2015

jstraub · ‎10-21-2016

@Hajime Basically, Ambari Infra is only a wrapper, a service like HDFS or YARN, which deploys/manages different components. At the moment, the only component is Solr, which is open source. As Constantin pointed out, the Ambari Infra Stack can be found here: https://github.com/apache/ambari/tree/2ad42074f1633c5c6f56cf979bdaa49440457566/ambari-server/src/main/resources/common-services/AMBARI_INFRA/0.1.0

pd47 · ‎04-05-2017

@ Constantin Stanca I thought the proper way to do the maintenance on the data node is to decommission it, so it can do the following tasks: Data Node - safely replicates the HDFS data to other DNs Node Manager - stop accepting new job requests Region Server - turns on drain mode In a urgent situation, I could agree on your suggestion. However, please advise me the right approach in a scenario where you have luxury to choose the maintenance window.

cstanca · ‎10-11-2016

@Smart Solutions The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc. Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides. If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way You excluded distcp as an option. As such, I suggest to look at Falcon. Check this: http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon/ +++++++ if any response addressed your question, please vote and accept best answer.

cstanca · ‎10-06-2017

@uday kv See this new article: https://community.hortonworks.com/articles/141035/jmeter-kerberos-setup-for-hive-load-testing.html

aervits · ‎10-13-2016

@ed day please consider revising your posts to remove your server names, this is a public forum and you're compromising your security.

cstanca · ‎09-30-2016

@Vijaya Narayana Reddy Bhoomi Reddy Edge nodes, while they may be in the same subnet with your HDP clusters, they are really not part of the actual clusters and as such there is no HDP configuration trick to redirect via edge nodes and Private Link. If you wish to use the 10 GB Private Link, it is just a matter of working with your network team to have those HDP clusters communicate via that Private Link instead of the firewall channeled network (doubt that they will want to do it). You did not put a number next to that "Firewall" line, but I assume that is much smaller since you want to use the other one. Maybe the network team needs to upgrade the firewall channeled network to meet the SLA. That is the correct approach and not use some trick to use the Private Link between edge nodes. It would meet your SLA and will also make network team happy to keep the firewall function in place. Network team may be able to peer-up those clusters to redirect the traffic through the private link without going through the edge nodes and by-passing the firewall channeled network, but I am pretty that they will break their network design principles going that way. The best approach is to upgrade the firewall channeled network to meet your needs.

cstanca · ‎10-02-2016

@Eric Periard The one you really need (1.1.0) is here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-examples/1.1.0 Pls don't forget to vote and accept the response that helped.

xfox · ‎10-14-2016

In my config zeppelin working with kerberos )

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Apache Storm Topology Tuning Approach

Zookeeper Sizing and Placement

Re: Is Ambari Infra open source?

Re: Data Node maintenance

Re: Backing up HDFS production data

Re: JMeter Setup for Hive Load Testing

Re: Cannot connect to resource manager

Re: DistCp Network considerations

Re: hadoop-examples-1.1.0-SNAPSHOT.jar - MISSING?

Re: Is Kerberos ready for production?