About mqureshi

mqureshi · ‎02-07-2017

Your region server size is just about right and memstore flush size seems reasonable too. How many column families do you have in your tables? Use the following to determine your number of regions Usage of the region server's memstore largely determines the maximum number of regions for the region server. Each region has its own memstores, one for each column family, which grow to a configurable size, usually between 128 and 256 Mb. Administrators specify this size with the hbase.hregion.memstore.flush.size property in the hbase-site.xml configuration file. The region server dedicates some fraction of total memory to region memstores based on the value of the hbase.regionserver.global.memstore.size configuration property. If usage exceeds this configurable size, HBase may become unresponsive or compaction storms might occur. Use the following formula to estimate the number of Regions for a region server: (regionserver_memory_size) * (memstore_fraction) / ((memstore_size) * (num_column_families)) For example, assume the following configuration: region server with 16 Gb RAM (or 16384 Mb) Memstore fraction of .4 Memstore with 128 Mb RAM 1 column family in table The formula for this configuration would look as follows: (16384 Mb * .4) / ((128 Mb * 1) = approximately 51 regions The easiest way to decrease the number of regions for this example region server is increase the RAM of the memstore to 256 Mb. The reconfigured region server would then have approximately 25 regions, and the HBase cluster will run more smoothly if the reconfiguration is applied to all region servers in the cluster. The formula can be used for multiple tables with the same configuration by using the total number of column families in all the tables.

mqureshi · ‎02-07-2017

According to following link, this is the expected behavior (notice, show database is not mapped to Ranger) https://cwiki.apache.org/confluence/display/RANGER/Hive+Commands+to+Ranger+Permission+Mapping

mqureshi · ‎02-07-2017

Snapshots capture metadata at point in time so you can recover the state of your cluster to certain point in time if someone accidentally deletes something or due to some other failure if you would like to go back to certain point in time. One thing to remember with snapshots is that they only capture state at a point in time. No data copying occurs with Snapshot. It's a pretty quick O(1) operation. However, if you are creating many snapshots, then when you delete data, Hadoop will see a snapshot pointing to that data and instead of deleting your data, it will move it to an archiving folder. Many people are surprised that somehow their disk space is not freeing up even though they have deleted a lot of data (they don't even see deleted files as they have been moved to an archiving folder). The culprit in these cases is usually Snapshot. But, see this is barely enough to make true backups. Across the industry, the mechanism for backups and disaster recovery is replication (both for Hive and HDFS as well as for HBase). So what you should be looking at is replication and then see if making snapshots also makes sense for your use cases (its almost always useful to have some snapshots - fewer snapshots means less disk space use). Tp copy data between clusters, you will use distcp. Check the following link: https://hadoop.apache.org/docs/r1.2.1/distcp2.html If my answer helped, please accept.

mqureshi · ‎02-06-2017

But you still have 6200 regions per region server? Is that right? That itself can cause a lot of master activity. You need to change your settings so you have less regions per region server. What is the heap size of your region server? What is memstore size for each table (hbase.hregion.memstore.flush.size)?

mqureshi · ‎02-06-2017

@Naresh Kumar Korvi Can you update your minimum number of entries to what you want at minimum, let's say 500 files. Also since it's all similar data going into one file, I am assuming flow file attributes are same. Can you change your merge strategy to defragment? Finally I am your flow is something like this: consumeKafka -> mergecontent -> putHDFS Is that right?

mqureshi · ‎02-06-2017

HBase Master (HMaster) has small set of responsibilities which does not require a lot of memory. HMaster is used for administrative tasks like assigning regions, updating meta table and for DDL statements. Clients do not interact with HMaster when they need to read/scan/write data. You can easily reduce your HMaster heap size to 4 GB. That being said, 6200 regions per region server is too high? Is this uniform across the cluster or is this a result of hot spotting on some regions which will indicate poor key design. I used to recommend no more than 200 regions per region server and I am aware that with new improvements this can be increased to may be 500 regions on the high side but 6200 regions per region server is un heard of. If you are seeing issues with performance and running into failures then you need to fix this first. If your regions are not balanced, then see if hbase region balancer is enabled or not (enabled by default).

mqureshi · ‎02-05-2017

@hardik desai Just to add to Artem's answer, yes what you are trying to do is possible with HDFS federation but like Artem said, it's not supported. What is your motivation of doing it? Do you really have multiple PBs of data where you will have 1000's of nodes and your namenode wil not be able to handle the cluster? If not, then federation is not the right way to implement multi tenancy.

mqureshi · ‎02-04-2017

I think I understand your point. See my new answer.

mqureshi · ‎02-04-2017

I think there is confusion on how we are defining colocated clients. This is because of the way I first understood and answered your question. Your edge nodes are different from what Hadoop calls colocated client which is probably what you have read somewhere. When a Map Reduce job runs, it spawns number of mappers. each mapper is usually reading data on the node it is running on (the local data). Each of these mappers is a client or colocated client assuming its reading local data. However, these mappers were not taking advantage of the fact that data they are reading is local to where these mappers are running (A mapper in some cases might read data from a remote machine which is going to be inefficient if it happens). The mappers were using the same TCP protocol in reading local data as they would use to read remote data. It was determined that performance can be improved about 20-30% only by making these mappers read data blocks directly off of disk if they can be made aware of the local data. Hence this change was made to improve that performance. If you would like more details, please see the following JRA. https://issues.apache.org/jira/browse/HDFS-347 If you scroll down in this jira then you will see a design document. That design document will clear any confusion you may have.

mqureshi · ‎02-03-2017

@Maxwell Flanders Rule of thumb varies on your specific use case. How many users are connecting concurrently? Regardless 256 MB is very low. Bump it to at least 4GB and see if this resolves your issue. Change -Xmx256m to -Xmx4096m. I was able to find recommended values for HS2. Use this as a guideline: Up to 20 concurrent connections 6 GB Up to 10 concurrent connections 4 GB Single connection 2 GB

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: HBASE Master Heap Size Recommendation

Re: All hive databases are showing on show databas...

Re: Snapshots and backups

Re: HBASE Master Heap Size Recommendation

Re: Consuming Kafka, each Json Messages and write...

Re: HBASE Master Heap Size Recommendation

Re: HDFS Federation is officially supported by hor...

Re: Co-located client

Re: Co-located client

Re: hiveserver2 out of memory - what settings to u...