About rushikeshdeshmu

rushikeshdeshmu · ‎03-29-2016

@Emily Sharpe, If the original question is answered then please accept the best answer.

rushikeshdeshmu · ‎03-18-2016

@Emily Sharpe, please refer below links: http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ http://stackoverflow.com/questions/30571839/region-splitting-in-hbase

rushikeshdeshmu · ‎03-15-2016

@Artem Ervits, I was referring for both hdfs and hbase, got required answer. But, thanks for your suggestion.

rushikeshdeshmu · ‎03-15-2016

@Mayank Shekhar, thanks for sharing this information.

rushikeshdeshmu · ‎03-15-2016

@Emily Sharpe, please review answers to your questions, and if they are acceptable, mark them as "accepted" so the responder can get credit. Otherwise, ask clarifying questions in comments so you can get your question answered. Thanks.

rushikeshdeshmu · ‎03-15-2016

@Emily Sharpe, Managed Splitting Usually HBase handles the splitting of regions automatically: once the regions reach the configured maximum size, they are split into two halves, which then can start taking on more data and grow from there. This is the default behavior and is sufficient for the majority of use cases. There is one known problematic scenario, though, that can cause what is called split/ compaction storms: when you grow your regions roughly at the same rate, eventually they all need to be split at about the same time, causing a large spike in disk I/O because of the required compactions to rewrite the split regions. Rather than relying on HBase to handle the splitting, you can turn it off and manually invoke the split and major_compact commands. This is accomplished by setting the hbase.hregion.max.filesize for the entire cluster, or when defining your table schema at the column family level, to a very high number. Setting it to Long.MAX_VALUE is not recommended in case the manual splits fail to run. It is better to set this value to a reasonable upper boundary, such as 100 GB (which would result in a one-hour major compaction if triggered). The advantage of running the commands to split and compact your regions manually is that you can time-control them. Running them staggered across all regions spreads the I/O load as much as possible, avoiding any split/compaction storm. You will need to implement a client that uses the administrative API to call the split() and majorCom pact() methods. Alternatively, you can use the shell to invoke the commands interactively, or script their call using cron, for instance. Also see the RegionSplitter (added in version 0.90.2), discussed shortly, for another way to split existing regions: it has a rolling split feature you can use to carefully split the existing regions while waiting long enough for the involved compactions to complete (see the -r and -o command-line options). An additional advantage to managing the splits manually is that you have better control over which regions are available at any time. This is good in the rare case that you have to do very low-level debugging, to, for example, see why a certain region had problems. With automated splits it might happen that by the time you want to check into a specific region, it has already been replaced with two daughter regions. These regions have new names and tracing the evolution of the original region over longer periods of time makes it much more difficult to find the information you require. Region Hotspotting Using the metrics you can determine if you are dealing with a write pattern that is causing a specific region to run hot. If this is the case, you may need to salt the keys, or use random keys to distribute the load across all servers evenly. The only way to alleviate the situation is to manually split a hot region into one or more new regions, at exact boundaries. This will divide the region’s load over multiple region servers. As you split a region you can specify a split key, that is, the row key where you can split the given region into two. You can specify any row key within that region so that you are also able to generate halves that are completely different in size. This might help only when you are not dealing with completely sequential key ranges, because those are always going to hit one region for a considerable amount of time. Table Hotspotting Sometimes an existing table with many regions is not distributed well—in other words, most of its regions are located on the same region server.# This means that, although you insert data with random keys, you still load one region server much more often than the others. You can use the move() function, from the HBase Shell, or use the HBaseAdmin class to explicitly move the server’s table regions to other servers. Alternatively, you can use the unassign() method or shell command to simply remove a region of the affected table from the current server. The master will immediately deploy it on another available server. Presplitting Regions Managing the splits is useful to tightly control when load is going to increase on your cluster. You still face the problem that when initially loading a table, you need to split the regions rather often, since you usually start out with a single region per table. Growing this single region to a very large size is not recommended; therefore, it is better to start with a larger number of regions right from the start. This is done by presplitting the regions of an existing table, or by creating a table with the required number of regions. The createTable() method of the administrative API, as well as the shell’s create command, both take a list of split keys, which can be used to presplit a table when it is created. HBase also ships with a utility called RegionSplitter, which you can use to create a presplit table. Starting it without a parameter will show usage information: $ ./bin/hbase org.apache.hadoop.hbase.util.RegionSplitter usage: RegionSplitter -c Create a new table with a pre-split number of regions -D Override HBase Configuration Settings -f Column Families to create with new table. Required with -c -h Print this usage help -o Max outstanding splits that have unfinished major compactions -r Perform a rolling split of an existing region --risky Skip verification steps to complete quickly.STRONGLY DISCOURAGED for production systems. By default, it used the MD5StringSplit class to partition the row keys into ranges. You can define your own algorithm by implementing the SplitAlgorithm interface provided, and handing it into the utility using the -D split.algorithm= parameter. An example of using the supplied split algorithm class and creating a presplit table is: $ ./bin/hbase org.apache.hadoop.hbase.util.RegionSplitter \ -c 10 testtable -f colfam1 In the web UI of the master, you can click on the link with the newly created table name to see the generated regions: testtable,,1309766006467.c0937d09f1da31f2a6c2950537a61093. testtable,0ccccccc,1309766006467.83a0a6a949a6150c5680f39695450d8a. testtable,19999998,1309766006467.1eba79c27eb9d5c2f89c3571f0d87a92. testtable,26666664,1309766006467.7882cd50eb22652849491c08a6180258. testtable,33333330,1309766006467.cef2853e36bd250c1b9324bac03e4bc9. testtable,3ffffffc,1309766006467.00365940761359fee14d41db6a73ffc5. testtable,4cccccc8,1309766006467.f0c5045c304c2ff5338be27e81ae698e. testtable,59999994,1309766006467.2d854f337aa6c09232409f0ba1d4964b. testtable,66666660,1309766006467.b1ec9df9fd90d91f54cb18da5edc2581. testtable,7333332c,1309766006468.42e179b78663b64401079a8601d9bd06. Or you can use the shell’s create command: hbase(main):001:0> create 'testtable', 'colfam1', \ { SPLITS => ['row-100', 'row-200', 'row-300', 'row-400'] } 0 row(s) in 1.1670 seconds This generates the following regions: testtable,,1309768272330.37377c4ab0a944a326ba8b6596a29396. testtable,row-100,1309768272331.e6092cc777f58a08c61bf081aba14916. testtable,row-200,1309768272331.63c9630a79b37ebce7b58cde0235dfe5. testtable,row-300,1309768272331.eead6ad2ff3303ffe6a3126e0df3ff7a. testtable,row-400,1309768272331.2bee7417fa67e4ac8c7210ce7325708e. As for the number of presplit regions to use, you can start low with 10 presplit regions per server and watch as data grows over time. It is better to err on the side of too few regions and using a rolling split later, as having too many regions is usually not ideal in regard to overall cluster performance. Alternatively, you can determine how many presplit regions to use based on the largest store file in your region: with a growing data size, this will get larger over time, and you want the largest region to be just big enough so that is not selected for major compaction—or you might face the mentioned compaction storms. If you presplit your regions too thin, you can increase the major compaction interval by increasing the value for the hbase.hregion.majorcompaction configuration property. If your data size grows too large, use the RegionSplitter utility to perform a network I/O safe rolling split of all regions. Use of manual splits and presplit regions is an advanced concept that requires a lot of planning and careful monitoring. On the other hand, it can help you to avoid the compaction storms that can happen for uniform data growth, or to shed load of hot regions by splitting them manually. Hope this helps you understand "how to manually manage hbase regions splits?"

rushikeshdeshmu · ‎03-13-2016

@Mayank Shekhar, thanks for sharing this information and link.

rushikeshdeshmu · ‎03-12-2016

@Neeraj Sabharwal, thanks for quick reply.

rushikeshdeshmu · ‎03-12-2016

Hi, Can anyone know if, Is it possible to do an incremental import using Sqoop? If yes, How?

rushikeshdeshmu · ‎03-12-2016

@Neeraj Sabharwal, got the required answer, choosing the best answer and closing this thread.

Online	Offline
Last Visited	‎08-15-2019 08:33 PM

Member Since	‎02-12-2016 01:04 PM
Last Visited	‎08-15-2019 08:33 PM
Posts	102
Kudos received	114

Cloudera Community

Re: How to manually manage number of HBase regions...

Re: What is Sort Merge Bucket (SMB) Join in Hive? ...

Re: Can Flume be used with HBase? How?

Re: Dynamic oozie action name?

Re: Nagios and Gamglia installation

Re: How to manually manage number of HBase regions...

Re: How to manually manage number of HBase regions...

Re: What is best approach to handling GC pause err...

Re: Single records of a file split into multiple?

Re: How to manually manage number of HBase regions...

Re: How to manually manage number of HBase regions...

Re: Single records of a file split into multiple?

Re: Is it possible to do an incremental import usi...

Is it possible to do an incremental import using S...

Re: What is use of Flatten in Pig?