Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 25754 | 12-12-2016 05:27 PM |
10-21-2016
12:06 AM
13 Kudos
Sizing There's no simple rule of thumb for this, it's as much an art as it is a science, as it depends on the workloads and how chatty they are with your current ZKs. One way to look at this is: If you have 3 ZKs you can
afford to lose one, if you have 5 you can afford to lose two. If your IT is aggressively
applying security patches and other upgrades, like firmware, kernel, Java,
other packages used by Hadoop tools, and taking nodes down to do the job, then
during those upgrades with 3 ZKs, you ZK runs with only two nodes, and if you
are unlucky and one of them goes down, then your whole cluster will go down.
So, in this case 5 are better. Warning: the more ZK nodes you have, the slower
the ZK becomes for writes. Placement Zookeeper is a master node, as such it can be collocated with
other master services. Ideally, you would not want to collocate it with an HA service. It is quite light on memory and CPU requirements, but since is disk intensive, don't collocate it with disk-intensive services like Kafka or HDFS. Storage Requirements In general, Zookeeper doesn't actually require huge drives because it will only store metadata information for many services, It is common to use 100G to 250G for zookeeper data directory and logs which is fine of many cluster deployments. Moreover, it is recommended to set configuration for automatic purging policy of snapshots and logs directories so that it doesn't end up by filling all the local storage. Dedicated or Shared? At Yahoo!, ZooKeeper is usually deployed on
DEDICATED RHEL boxes, with dual-core processors, 2GB of RAM, and 80GB IDE hard
drives. For your
Kafka/Storm cluster, you could consider deploying ZK on DEDICATED physical
hardware (not virtual). The driving force for physical hardware or at
least for the dedicated disk is the transaction log and the high throughput
nature of Kafka and Storm. Since Kafka is usually used with Storm, have a separate
Zookeeper cluster for Kafka and Storm. Kafka and Storm are sharing then, please make
sure that you don’t put the Zookeeper cluster on the Kafka nodes. Put
the Zookeeper on the Storm nodes. Caution
Rather than going to larger clusters of ZKs, it is better to split out certain services to their own ZKs when they're putting more pressure on an otherwise fairly quiet ZK cluster. It is a good thing to have separate set of ZK for each cluster, one quorum for Kafka, one quorum for Storm, one quorum for the rest (YARN, HBase, Hive, HDFS), possibly separate zookeepers for HBase. Challenge is that more hardware is needed and more administration, but it pays off. Be careful where you put that transaction log. The most performance-critical part of ZooKeeper is the transaction log. ZooKeeper must sync transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely impact performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it can mitigate it. ZooKeeper's transaction log must be on a dedicated device. A dedicated partition is not enough. ZooKeeper writes the log sequentially, without seeking. Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays. Do not put ZooKeeper in a situation that can cause a swap. In order for ZooKeeper to function with any sort of timeliness, it simply cannot be allowed to swap. Remember, in ZooKeeper, everything is ordered, so if one request hits the disk, all other queued requests hit the disk. Going to disk unnecessarily will almost certainly degrade your performance unacceptably. Therefore, make certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper. Set your Java max heap size correctly. To avoid swapping, try to set the heapsize to the amount of physical memory you have, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is to run load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a 4G machine, a 3G heap is a conservative estimate to start with. Best Practices
The ZooKeeper data directory contains the snapshot and transactional log files. It is a good practice to periodically clean up the directory if the auto-purge option is not enabled. Also, an administrator might want to keep a backup of these files, depending on the application needs. However, since ZooKeeper is a replicated service, we need to back up the data of only one of the servers in the ensemble. ZooKeeper uses Apache log4j as its logging infrastructure. As the logfiles grow bigger in size, it is recommended that you set the auto-rollover of the logfiles using the in-built log4j feature for ZooKeeper logs. The list of ZooKeeper servers used by the clients in their connection strings must match the list of ZooKeeper servers that each ZooKeeper server has. Strange behaviors might occur if the lists don't match. The server lists in each Zookeeper server configuration file should be consistent with the other members of the ensemble. The ZooKeeper transaction log must be configured in a dedicated device. This is very important to achieve best performance from ZooKeeper. The Java heap size should be chosen with care. Swapping should never be allowed to happen in the ZooKeeper server. It is better if the ZooKeeper servers have a reasonably high memory (RAM). System monitoring tools such as vmstat can be used to monitor virtual memory statistics and decide on the optimal size of memory needed, depending on the need of the application. In any case, swapping should be avoided. References: https://community.hortonworks.com/questions/2498/best-practices-for-zookeeper-placement.html https://community.hortonworks.com/questions/53025/zookeeper-performance-and-metrics-when-to-resize.html https://community.hortonworks.com/questions/55868/zookeeper-on-even-master-nodes.html Apache ZooKeeper Essentials by Saurav Haloi Published by Packt Publishing, 2015
... View more
10-06-2017
06:21 PM
@uday kv See this new article: https://community.hortonworks.com/articles/141035/jmeter-kerberos-setup-for-hive-load-testing.html
... View more
08-12-2016
11:03 PM
3 Kudos
Introduction This article is not meant to show how to install or create a
“Hello World” Nifi data flow, but how to resolve a data filtering problem with
NiFi providing two approaches, using a filter list as a file on the disk, which could be
static or dynamic, and a list stored in a distributed cache populated from the
same file. The amount of data used was minimal and simplistic and no
performance difference can be perceived, however, at scale, where memory is
available, a caching implementation should perform better. This article assumes some familiarity with NiFi, knowing
what a processor or a queue is and how to set the basic configurations for a
processor or a queue, also how to visualize the data at various steps throughout
the flow, starting and stopping processors. Since you are somehow familiar with Nifi, you probably know
how to install it and start it, however, I will provide a quick refresher
below. Pre-requisites For this demo, I used the latest version of Nifi available
at the date of working on this demo, 0.6.1.
This version was not part of the HDP 2.4.2 which was available at the
time of this demo, it has also 0.5.1. HDP 2.5 was just launched last month at
the Hadoop Summit in Santa Clara. If you wanted for your OSX installation to use brew install nifi that will only install nifi 0.5.1 which
does not have some of the features needed for the demo, e.g.
PutDistributedCacheMap or FetchDistributedCacheMap. Instead, use the following steps: A reference about downloading and installing on Linux and Mac is here. You can download Nifi 0.6.1 from any of the sites listed
there, for example: https://www.apache.org/dyn/closer.lua?path=/nifi/0.6.1/nifi-0.6.1-bin.tar.gz I prefer wget and installing my apps to /opt cd /opt
wget http://supergsego.com/apache/nifi/0.6.1/nifi-0.6.1-bin.tar.gz That will download a 421.19MB file tar –xvf nifi-0.6.1-bin.tar.gz
ls -l and here is your /opt/nifi-0.6.1 cd /opt/nifi-0.6.1/bin That is your NIFI_HOME. ./nifi.sh start Open a browser and type: http://localhost:8080/nifi You will need to import the NiFi .xml template posted in my github
repo, mentioned earlier. Clone it to your local folder of preference, assuming that you have a
git client installed: git clone https://github.com/cstanca1/nifi-filter.git After importing the model, instantiate it. It will show as the following: Required Changes In order for the template to work for your specific folder structure, you will need to make a few changes to tell GetFile processor (right-click on Get File
processor header, View Configuration, Properties tab, Input Directory, from where
to go to get the data). Keep in mind that the GetFile processor once started it
will read the file and delete it. If you want to re-feed it for test, you just
have to drop it again in the same folder and it will re-ingest it. You can also
place multiple files of the same structure in that folder and they will be
ingested all and every line. In real-life, GetFile can be replaced with a
different processor capable to read from an actual log. For this demo, I used a
static file as an input. Also, enable and start DistributedMapCacheServer Controller Service.This is required for the put and fetch
distributed cache. The DistributedMapCacheServer can be started just like any other Controller Service, by configuring it to be valid and hit the "start" button. The unique thing about the DistributedMapCacheServer is that processors work with the cache by utilizing a DistributedMapCacheClientService. So you will create both a Server and Client Service. Then configure the processor to use the Client Service. Next start both the server and service. Finally start the processor. Test Data For your test, you can use the two files
checked-in to the git repo that you just cloned locally: macaddresses-blacklist.txt and macaddresses-log.txt. macaddresses-blacklist.txt is a list of
blacklisted mac addresses which will be used to filter the incoming stream fed
by macaddresses-log.txt using GetFile ingest, line by line. To understand what happens step-by-step, I
suggest to start each processor and inspect the queue and data lineage. Populating
DistributedMapCache is performed in the flow presented on the right side of the
model that you imported at the previous step. The filtering flow, via Scan Attribute or
FetchDistributedCacheMap: Use of GetFile, SplitText and ExtractText processor is well
documented and a basic Google will return several good examples, however, a
good example of how to use FetchDistributedMapCache and PutDistributedMapCache
is not that well documented. That was the main reason to write this article. I
could not find another good reference. I am sure others felt the same way and
hopefully this helps. ScanAttribute
Approach Before starting this processor, you need to right click on
its header, choose Configuration and go to Properties tab and change the
Dictionary file to be your macaddress-blacklist.txt. This is a clone of the
same file you use in GetFile processor, but I suggest to put it in a separate
folder as such GetFile will not ingest it and delete it after use. This needs to
be permanent like a lookup file on the disk. SplitText is used to split the file line by line. You can
check this property by righ-clicking the header of the processor and choosing “Properties”
tab. Line Split Count is set to 1. ExtractText processor uses a custom regex to extract the mac
address from macaddress-log.txt. You can
find in Properties, last property in the list. ScanAttribute processor sets the Dictionary File to the
folder/file of choice. In this demo, I used macaddresses-blacklist.txt file
included in the repo that you cloned at one of the previous steps. DistributedCache
Approach The right branch of the flow in the left uses the
DistributedCache populated by the flow on the right of the model. Inspect each
processor by checking each processor properties. They are already similar with
the first half of the flow on the left, excepting the use of
PutDistributedCache processor which sets the Cache Entry Identifier for
mac.address value. I’ll refer only to the consumption of the mac.address
property value set by PutDistributedCache. Set Cache Entry Identifier to the same mac.address Please note that DistributedMapCacheClientService is
enabled. You can achieve that by clicking on NiFi Flow Settings icon, fourth on
the right corner, "Controller Services" Learning Technique Don’t forget to start all processors and inspect the queues.
My approach is to start processors one at the time in the order of the data
flow and processing and check all the stats and lineage on connection queues.
This is what I love about NiFi, it is so easy to test and learn. Credit Thanks to Simon Ball for taking a few minutes of his time to
review the model on DistributedCache approach. Conclusion Dynamic filtering has large applicability in any type of simple event processing. I am sure that there are
many other ways to skin the same cat with NiFi. Enjoy!
... View more
Labels:
07-14-2016
02:21 AM
Multi-Polygon example: SELECT st_astext(ST_MultiPolygon('multipolygon (((0 0, 0 1, 1 0, 0 0)), ((2 2, 2 3, 3 2, 2 2)))'))
from YourTable LIMIT 1;
... View more
04-18-2017
03:45 PM
@rbiswas What about adding new host and assign it at the same time to a specific rack? I'd like to avoid the host (data node) and then have to set the rack and restart again. Is there a way via Ambari UI?
... View more
07-06-2016
09:32 PM
@Alex McLintock You are correct. This is a local reference to a remote repo. Here is a reference for an actual local repo with no Internet access, but it gets complicated with Vagrant: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/_setting_up_a_local_repository_with_no_internet_access.html
... View more
06-20-2017
04:59 AM
Hi, Is there any steps to take our own kafka server metrics to elasticsearch..because we have grafana which will have all the dashboards but some of our project requirement need to keep few kafka metrics in kibana visualization.so we want to index the kafka metrics logs to elasticsearch. Can we consume kafka metrics into elasticsearch ?
... View more
05-24-2016
03:53 PM
@Constantin Stanca what do you mean by "Additionally, you need to start the JVM with something like this in order to be able to truly access the JVM remotely"? JVM start as they normally do to use this tool.
... View more
05-27-2016
09:34 PM
@Vladimir Zlatkin Updated section "NUMA optimization" to include a link to OS CPU optimizations for RHEL. "Spark applications performance could be improved not only by configuring various Spark parameters and JVM options, but also using the operating system side optimizations, e.g. CPU affinity, NUMA policy, hardware performance policy etc. to take advantage of the most recent hardware NUMA capable." The referenced section is vast. We could qualify some of the settings to best choices. This could be a follow-up article if it part 1 presented real interest.
... View more
- « Previous
- Next »