About mqureshi

mqureshi · ‎02-08-2017

What is the value of hbase.use.dynamic.jar. If this is false, you need to set this to true.

mqureshi · ‎02-08-2017

@Ashok Kumar BM I deleted my earlier comment because I didn't realize, that patch was old. While, this should work, is it possible for you to add the jar to HBase_CLASSPATH and restart HBase service (not something you should be doing normally but since there is an issue, can you try this)? HBASE_CLASSPATH=<path to your custom jar on local file system on all region servers>

mqureshi · ‎02-08-2017

@Ashok Kumar BM Your jar file is not in the classpath. What is the value of property hbase.dynamic.jars.dir? You are not supposed to add your jar to master server but to the location pointed by this property. That's your problem.

mqureshi · ‎02-08-2017

balancer is not for single node. It is for balancing load on the cluster. For balancing load among different disks on the same node, a new disk balancer will be available in Hadoop version 3.0 as the link Artem shared, shows. There is not much you would be able to do here, except for the fact that "don't worry, Hadoop is smart enough to know that particular disk doesn't have any more space and it will find another disk" 🙂

mqureshi · ‎02-08-2017

Once you have balanced the cluster, and you start seeing your regions moving to other nodes until it's reasonably balanced, then we can decide how to move forward. One of the things is to increase global memstore fraction to 0.4 and block cache to 0.4, but just hold on to it for now until your cluster is balanced. To run balancer from shell, run "balancer" command. to see if balancer is enabled, type "balance_switch". It should give you True. if it's false then run "balancer_switch true". see details on following link under "hbase surgery". https://learnhbase.wordpress.com/2013/03/02/hbase-shell-commands/

mqureshi · ‎02-07-2017

@Mark Wallace Can you periodically run uncacheTable("table name") followed by cachetable("table name")? Both are inherited by HiveContext from SQLContext. Put the code you have above in a method along with adding lines for uncache and then simply call the method every hour. I am sorry if I am missing something.

mqureshi · ‎02-07-2017

@Mark Wallace Can you please provide more details?

mqureshi · ‎02-07-2017

couple of questions. Assuming that load balancer was not enabled before, did you had 6200 regions per region server across the cluster or are there couple of nodes that are offending but rest is okay? What is the size of your block cache (hfile.block.cache.size)? The sum of block cache and memstore should not be more than 0.8. 0.25 of memstore is actually on the lower end which means you are likely tuning your block cache to a higher value. Otherwise, 0.25 memstore may need to be increased. Your global memstore is 0.25 which means your memstore total should not go beyond (16 GB x 0.25 = 4GB). But with 6200 regions x 128 MB each = 793600 MB = 793.6 GB. this is pretty messed up. You need to bring that region count to a manageable value before you look into anything else. 8 GB for Master is too much but for now, focus on reducing region count.

mqureshi · ‎02-07-2017

It is point in time (time at which distcp runs). It can be automated using scripts. It is replication because you are replicating data. you are confusing replication with real time replication. Replication doesn't have to be real time. And it is impossible because of physics that any change in data in one cluster (or anyother database) will be reflected in the other. I'll give you the example of my HBase use case. We were using Active-Active replication. But even then, we knew that there might be a situation where data will be written to one data center and a power failure will occur as data is being replicated to the remote data center and some data will not be replicated (let's say up to 10 to seconds of data). The only other way to make sure that this does not happen is to "Ack" only when data has been written to the remote data center. This slows down every write and we had 10's of thousands of writes per second. See, you have to make a choice. If you would like to have 100 percent sync then you have to make sure that you ack every single record being written slowing down all your writes. Or you can do asynchronous replication which works 99.99% of the time but in case of network issues between two data centers, you know some data some time will not be replicated right away. There is absolutely nothing technology can do anything here. This is simple Physics.

mqureshi · ‎02-07-2017

See my answers inline below: --->So by replication do you mean distcp copy across two clusters - but distcp copies are not real time - so can they be called as replication? Yes, replication can be enabled by either distcp or if you are streaming data using some ingestion tool like Nifi, you can send data to both clusters (Active and backup) at the same time in real time. In practice and because of Physics, you cannot just have 100 percent sync between two clusters so distcp is almost always the way to go. --->I have read that Flume can be used to copy from a source to two different clusters - but even such a configuration would exist outside hdfs. Is this what you meant by replication? I would prefer Nifi over Flume, but by replication, my idea is more of using distcp. Two data centers may be 1000 miles apart cannot be in synced for every single second and the cost and complexity of trying to achieve that is also very high. I have seen some very low latency telco use cases and those are some very few use cases where active-active is justified (remember very complex). Highly recommend you read this. ---> I read that Hive can be backed up using Snapshots in an incremental way. That is take a snapshot of hive data at one point in time and from then on (to another point in time) take snapshots and use the difference feature to get the incrementals between the current snapshot and the previous one. So data can be recovered by loading the full Snapshot and the incrementals to a point in time (like RDBMS recovery). Is this workable? I am not aware of this method. There are two things you need to do for Hive replication. a. Replicate metadata (use MySQL replication techniques or whichever metadata database you are using) b. Use distcp to replicate HDFS files.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Hbase custom filter

Re: Hbase custom filter

Re: Hbase custom filter

Re: Hadoop Balancer

Re: HBASE Master Heap Size Recommendation

Re: Refresh Dataframe in Spark real-time Streaming...

Re: Refresh Dataframe in Spark real-time Streaming...

Re: HBASE Master Heap Size Recommendation

Re: Snapshots and backups

Re: Snapshots and backups