Member since
02-12-2016
102
Posts
117
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13714 | 03-15-2016 06:36 AM | |
15741 | 03-12-2016 10:04 AM | |
3366 | 03-12-2016 08:14 AM | |
1034 | 03-04-2016 02:36 PM | |
1927 | 02-19-2016 10:59 AM |
03-29-2016
07:36 AM
@Emily Sharpe, If the original question is answered then please accept the best answer.
... View more
03-18-2016
06:59 PM
2 Kudos
@Emily Sharpe, please refer below links: http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ http://stackoverflow.com/questions/30571839/region-splitting-in-hbase
... View more
03-15-2016
07:25 PM
@Artem Ervits, I was referring for both hdfs and hbase, got required answer. But, thanks for your suggestion.
... View more
03-15-2016
06:35 PM
@Mayank Shekhar, thanks for sharing this information.
... View more
03-15-2016
08:53 AM
@Emily Sharpe, please review answers to your questions, and if they are acceptable, mark them as "accepted" so the responder can get credit. Otherwise, ask clarifying questions in comments so you can get your question answered. Thanks.
... View more
03-15-2016
06:36 AM
6 Kudos
@Emily Sharpe, Managed Splitting Usually HBase handles the splitting of regions automatically: once the regions reach
the configured maximum size, they are split into two halves, which then can start taking
on more data and grow from there. This is the default behavior and is sufficient for the
majority of use cases. There is one known problematic scenario, though, that can cause what is called split/
compaction storms: when you grow your regions roughly at the same rate, eventually
they all need to be split at about the same time, causing a large spike in disk I/O because
of the required compactions to rewrite the split regions. Rather than relying on HBase to handle the splitting, you can turn it off and manually
invoke the split and major_compact commands. This is accomplished by setting the
hbase.hregion.max.filesize for the entire cluster, or when defining your table schema
at the column family level, to a very high number. Setting it to Long.MAX_VALUE is not
recommended in case the manual splits fail to run. It is better to set this value to a
reasonable upper boundary, such as 100 GB (which would result in a one-hour major
compaction if triggered). The advantage of running the commands to split and compact your regions manually
is that you can time-control them. Running them staggered across all regions spreads
the I/O load as much as possible, avoiding any split/compaction storm. You will need
to implement a client that uses the administrative API to call the split() and majorCom
pact() methods. Alternatively, you can use the shell to invoke the commands interactively,
or script their call using cron, for instance. Also see the RegionSplitter (added
in version 0.90.2), discussed shortly, for another way to split existing regions: it has a
rolling split feature you can use to carefully split the existing regions while waiting long
enough for the involved compactions to complete (see the -r and -o command-line
options). An additional advantage to managing the splits manually is that you have better control
over which regions are available at any time. This is good in the rare case that you have
to do very low-level debugging, to, for example, see why a certain region had problems.
With automated splits it might happen that by the time you want to check into a specific
region, it has already been replaced with two daughter regions. These regions have new
names and tracing the evolution of the original region over longer periods of time makes
it much more difficult to find the information you require. Region Hotspotting Using the metrics you can determine
if you are dealing with a write pattern that is causing a specific region to run hot. If this is the case, you may need to salt the keys, or use random keys
to distribute the load across all servers evenly. The only way to alleviate the situation is to manually split a hot region into one or more
new regions, at exact boundaries. This will divide the region’s load over multiple region
servers. As you split a region you can specify a split key, that is, the row key where you
can split the given region into two. You can specify any row key within that region so
that you are also able to generate halves that are completely different in size. This might help only when you are not dealing with completely sequential key ranges,
because those are always going to hit one region for a considerable amount of time. Table Hotspotting Sometimes an existing table with many regions is not distributed well—in other words,
most of its regions are located on the same region server.# This means that, although
you insert data with random keys, you still load one region server much more often
than the others. You can use the move() function, from the HBase Shell, or use the HBaseAdmin class to explicitly move
the server’s table regions to other servers. Alternatively, you can use the unassign()
method or shell command to simply remove a region of the affected table from the
current server. The master will immediately deploy it on another available server. Presplitting Regions Managing the splits is useful to tightly control when load is going to increase on your
cluster. You still face the problem that when initially loading a table, you need to split
the regions rather often, since you usually start out with a single region per table.
Growing this single region to a very large size is not recommended; therefore, it is better
to start with a larger number of regions right from the start. This is done by presplitting
the regions of an existing table, or by creating a table with the required number of
regions. The createTable() method of the administrative API, as well as the shell’s create command,
both take a list of split keys, which can be used to presplit a table when it is created. HBase also ships with a utility called RegionSplitter, which you can use to
create a presplit table. Starting it without a parameter will show usage information: $ ./bin/hbase org.apache.hadoop.hbase.util.RegionSplitter usage: RegionSplitter
-c Create a new table with a pre-split number of
regions -D
Override HBase Configuration Settings -f Column Families to create with new table.
Required with -c -h Print this usage help -o Max outstanding splits that have unfinished
major compactions -r Perform a rolling split of an existing region --risky Skip verification steps to complete
quickly.STRONGLY DISCOURAGED for production
systems. By default, it used the MD5StringSplit class to partition the row keys into ranges. You
can define your own algorithm by implementing the SplitAlgorithm interface provided,
and handing it into the utility using the -D split.algorithm=
parameter. An example of using the supplied split algorithm class and creating a presplit
table is: $ ./bin/hbase org.apache.hadoop.hbase.util.RegionSplitter \
-c 10 testtable -f colfam1 In the web UI of the master, you can click on the link with the newly created table name
to see the generated regions: testtable,,1309766006467.c0937d09f1da31f2a6c2950537a61093.
testtable,0ccccccc,1309766006467.83a0a6a949a6150c5680f39695450d8a.
testtable,19999998,1309766006467.1eba79c27eb9d5c2f89c3571f0d87a92.
testtable,26666664,1309766006467.7882cd50eb22652849491c08a6180258.
testtable,33333330,1309766006467.cef2853e36bd250c1b9324bac03e4bc9.
testtable,3ffffffc,1309766006467.00365940761359fee14d41db6a73ffc5.
testtable,4cccccc8,1309766006467.f0c5045c304c2ff5338be27e81ae698e.
testtable,59999994,1309766006467.2d854f337aa6c09232409f0ba1d4964b.
testtable,66666660,1309766006467.b1ec9df9fd90d91f54cb18da5edc2581.
testtable,7333332c,1309766006468.42e179b78663b64401079a8601d9bd06. Or you can use the shell’s create command: hbase(main):001:0> create 'testtable', 'colfam1', \
{ SPLITS => ['row-100', 'row-200', 'row-300', 'row-400'] } 0 row(s) in 1.1670 seconds This generates the following regions: testtable,,1309768272330.37377c4ab0a944a326ba8b6596a29396. testtable,row-100,1309768272331.e6092cc777f58a08c61bf081aba14916. testtable,row-200,1309768272331.63c9630a79b37ebce7b58cde0235dfe5. testtable,row-300,1309768272331.eead6ad2ff3303ffe6a3126e0df3ff7a. testtable,row-400,1309768272331.2bee7417fa67e4ac8c7210ce7325708e. As for the number of presplit regions to use, you can start low with 10 presplit regions
per server and watch as data grows over time. It is better to err on the side of too few
regions and using a rolling split later, as having too many regions is usually not ideal
in regard to overall cluster performance. Alternatively, you can determine how many presplit regions to use based on the largest
store file in your region: with a growing data size, this will get larger over time, and
you want the largest region to be just big enough so that is not selected for major
compaction—or you might face the mentioned compaction storms. If you presplit your regions too thin, you can increase the major compaction interval
by increasing the value for the hbase.hregion.majorcompaction configuration property.
If your data size grows too large, use the RegionSplitter utility to perform a network
I/O safe rolling split of all regions. Use of manual splits and presplit regions is an advanced concept that requires a lot of
planning and careful monitoring. On the other hand, it can help you to avoid the compaction
storms that can happen for uniform data growth, or to shed load of hot regions
by splitting them manually. Hope this helps you understand "how to manually manage hbase regions splits?"
... View more
03-13-2016
08:53 AM
@Mayank Shekhar, thanks for sharing this information and link.
... View more
03-12-2016
11:27 AM
1 Kudo
@Neeraj Sabharwal, thanks for quick reply.
... View more
03-12-2016
11:23 AM
2 Kudos
Hi, Can anyone know if, Is it possible to do an incremental import using Sqoop? If yes, How?
... View more
Labels:
- Labels:
-
Apache Sqoop
03-12-2016
10:54 AM
1 Kudo
@Neeraj Sabharwal, got the required answer, choosing the best answer and closing this thread.
... View more