Member since
02-12-2016
102
Posts
117
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
9764 | 03-15-2016 06:36 AM | |
10865 | 03-12-2016 10:04 AM | |
2033 | 03-12-2016 08:14 AM | |
560 | 03-04-2016 02:36 PM | |
974 | 02-19-2016 10:59 AM |
03-29-2016
07:36 AM
@Emily Sharpe, If the original question is answered then please accept the best answer.
... View more
03-18-2016
06:59 PM
2 Kudos
@Emily Sharpe, please refer below links: http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ http://stackoverflow.com/questions/30571839/region-splitting-in-hbase
... View more
03-15-2016
07:25 PM
@Artem Ervits, I was referring for both hdfs and hbase, got required answer. But, thanks for your suggestion.
... View more
03-15-2016
06:35 PM
@Mayank Shekhar, thanks for sharing this information.
... View more
03-15-2016
08:53 AM
@Emily Sharpe, please review answers to your questions, and if they are acceptable, mark them as "accepted" so the responder can get credit. Otherwise, ask clarifying questions in comments so you can get your question answered. Thanks.
... View more
03-15-2016
06:36 AM
6 Kudos
@Emily Sharpe, Managed Splitting Usually HBase handles the splitting of regions automatically: once the regions reach
the configured maximum size, they are split into two halves, which then can start taking
on more data and grow from there. This is the default behavior and is sufficient for the
majority of use cases. There is one known problematic scenario, though, that can cause what is called split/
compaction storms: when you grow your regions roughly at the same rate, eventually
they all need to be split at about the same time, causing a large spike in disk I/O because
of the required compactions to rewrite the split regions. Rather than relying on HBase to handle the splitting, you can turn it off and manually
invoke the split and major_compact commands. This is accomplished by setting the
hbase.hregion.max.filesize for the entire cluster, or when defining your table schema
at the column family level, to a very high number. Setting it to Long.MAX_VALUE is not
recommended in case the manual splits fail to run. It is better to set this value to a
reasonable upper boundary, such as 100 GB (which would result in a one-hour major
compaction if triggered). The advantage of running the commands to split and compact your regions manually
is that you can time-control them. Running them staggered across all regions spreads
the I/O load as much as possible, avoiding any split/compaction storm. You will need
to implement a client that uses the administrative API to call the split() and majorCom
pact() methods. Alternatively, you can use the shell to invoke the commands interactively,
or script their call using cron, for instance. Also see the RegionSplitter (added
in version 0.90.2), discussed shortly, for another way to split existing regions: it has a
rolling split feature you can use to carefully split the existing regions while waiting long
enough for the involved compactions to complete (see the -r and -o command-line
options). An additional advantage to managing the splits manually is that you have better control
over which regions are available at any time. This is good in the rare case that you have
to do very low-level debugging, to, for example, see why a certain region had problems.
With automated splits it might happen that by the time you want to check into a specific
region, it has already been replaced with two daughter regions. These regions have new
names and tracing the evolution of the original region over longer periods of time makes
it much more difficult to find the information you require. Region Hotspotting Using the metrics you can determine
if you are dealing with a write pattern that is causing a specific region to run hot. If this is the case, you may need to salt the keys, or use random keys
to distribute the load across all servers evenly. The only way to alleviate the situation is to manually split a hot region into one or more
new regions, at exact boundaries. This will divide the region’s load over multiple region
servers. As you split a region you can specify a split key, that is, the row key where you
can split the given region into two. You can specify any row key within that region so
that you are also able to generate halves that are completely different in size. This might help only when you are not dealing with completely sequential key ranges,
because those are always going to hit one region for a considerable amount of time. Table Hotspotting Sometimes an existing table with many regions is not distributed well—in other words,
most of its regions are located on the same region server.# This means that, although
you insert data with random keys, you still load one region server much more often
than the others. You can use the move() function, from the HBase Shell, or use the HBaseAdmin class to explicitly move
the server’s table regions to other servers. Alternatively, you can use the unassign()
method or shell command to simply remove a region of the affected table from the
current server. The master will immediately deploy it on another available server. Presplitting Regions Managing the splits is useful to tightly control when load is going to increase on your
cluster. You still face the problem that when initially loading a table, you need to split
the regions rather often, since you usually start out with a single region per table.
Growing this single region to a very large size is not recommended; therefore, it is better
to start with a larger number of regions right from the start. This is done by presplitting
the regions of an existing table, or by creating a table with the required number of
regions. The createTable() method of the administrative API, as well as the shell’s create command,
both take a list of split keys, which can be used to presplit a table when it is created. HBase also ships with a utility called RegionSplitter, which you can use to
create a presplit table. Starting it without a parameter will show usage information: $ ./bin/hbase org.apache.hadoop.hbase.util.RegionSplitter usage: RegionSplitter
-c Create a new table with a pre-split number of
regions -D
Override HBase Configuration Settings -f Column Families to create with new table.
Required with -c -h Print this usage help -o Max outstanding splits that have unfinished
major compactions -r Perform a rolling split of an existing region --risky Skip verification steps to complete
quickly.STRONGLY DISCOURAGED for production
systems. By default, it used the MD5StringSplit class to partition the row keys into ranges. You
can define your own algorithm by implementing the SplitAlgorithm interface provided,
and handing it into the utility using the -D split.algorithm=
parameter. An example of using the supplied split algorithm class and creating a presplit
table is: $ ./bin/hbase org.apache.hadoop.hbase.util.RegionSplitter \
-c 10 testtable -f colfam1 In the web UI of the master, you can click on the link with the newly created table name
to see the generated regions: testtable,,1309766006467.c0937d09f1da31f2a6c2950537a61093.
testtable,0ccccccc,1309766006467.83a0a6a949a6150c5680f39695450d8a.
testtable,19999998,1309766006467.1eba79c27eb9d5c2f89c3571f0d87a92.
testtable,26666664,1309766006467.7882cd50eb22652849491c08a6180258.
testtable,33333330,1309766006467.cef2853e36bd250c1b9324bac03e4bc9.
testtable,3ffffffc,1309766006467.00365940761359fee14d41db6a73ffc5.
testtable,4cccccc8,1309766006467.f0c5045c304c2ff5338be27e81ae698e.
testtable,59999994,1309766006467.2d854f337aa6c09232409f0ba1d4964b.
testtable,66666660,1309766006467.b1ec9df9fd90d91f54cb18da5edc2581.
testtable,7333332c,1309766006468.42e179b78663b64401079a8601d9bd06. Or you can use the shell’s create command: hbase(main):001:0> create 'testtable', 'colfam1', \
{ SPLITS => ['row-100', 'row-200', 'row-300', 'row-400'] } 0 row(s) in 1.1670 seconds This generates the following regions: testtable,,1309768272330.37377c4ab0a944a326ba8b6596a29396. testtable,row-100,1309768272331.e6092cc777f58a08c61bf081aba14916. testtable,row-200,1309768272331.63c9630a79b37ebce7b58cde0235dfe5. testtable,row-300,1309768272331.eead6ad2ff3303ffe6a3126e0df3ff7a. testtable,row-400,1309768272331.2bee7417fa67e4ac8c7210ce7325708e. As for the number of presplit regions to use, you can start low with 10 presplit regions
per server and watch as data grows over time. It is better to err on the side of too few
regions and using a rolling split later, as having too many regions is usually not ideal
in regard to overall cluster performance. Alternatively, you can determine how many presplit regions to use based on the largest
store file in your region: with a growing data size, this will get larger over time, and
you want the largest region to be just big enough so that is not selected for major
compaction—or you might face the mentioned compaction storms. If you presplit your regions too thin, you can increase the major compaction interval
by increasing the value for the hbase.hregion.majorcompaction configuration property.
If your data size grows too large, use the RegionSplitter utility to perform a network
I/O safe rolling split of all regions. Use of manual splits and presplit regions is an advanced concept that requires a lot of
planning and careful monitoring. On the other hand, it can help you to avoid the compaction
storms that can happen for uniform data growth, or to shed load of hot regions
by splitting them manually. Hope this helps you understand "how to manually manage hbase regions splits?"
... View more
03-13-2016
08:53 AM
@Mayank Shekhar, thanks for sharing this information and link.
... View more
03-12-2016
11:27 AM
1 Kudo
@Neeraj Sabharwal, thanks for quick reply.
... View more
03-12-2016
11:23 AM
2 Kudos
Hi, Can anyone know if, Is it possible to do an incremental import using Sqoop? If yes, How?
... View more
Labels:
- Labels:
-
Apache Sqoop
03-12-2016
10:54 AM
1 Kudo
@Neeraj Sabharwal, got the required answer, choosing the best answer and closing this thread.
... View more
03-12-2016
10:47 AM
3 Kudos
I got below answer: Sometimes
there is data in a tuple or bag and if we want to remove the level of
nesting from that data then Flatten modifier in Pig can be used.
Flatten un-nests bags and tuples. For tuples, the Flatten operator
will substitute the fields of a tuple in place of a tuple whereas
un-nesting bags is a little complex because it requires creating new
tuples.
... View more
03-12-2016
10:38 AM
1 Kudo
@Neeraj Sabharwal, thanks for quick reply.
... View more
03-12-2016
10:23 AM
4 Kudos
Hi, Can anyone explain what is use of Flatten in Pig?
... View more
Labels:
- Labels:
-
Apache Pig
03-12-2016
10:04 AM
2 Kudos
I got below answer: In
SMB join in Hive, each mapper reads a bucket from the first table and
the corresponding bucket from the second table and then a merge sort
join is performed. Sort Merge Bucket (SMB) join in hive is mainly
used as there is no limit on file or partition or table join. SMB
join can best be used when the tables are large. In SMB join the
columns are bucketed and sorted using the join columns. All tables
should have the same number of buckets in SMB join.
... View more
03-12-2016
10:01 AM
@Artem Ervits, thanks for reply and link.
... View more
03-12-2016
09:19 AM
3 Kudos
Hi, Can anyone explain What is Sort Merge Bucket (SMB) Join in Hive? When it is used?
... View more
Labels:
- Labels:
-
Apache Hive
03-12-2016
09:05 AM
1 Kudo
@Artem Ervits, thanks for reply and sharing link.
... View more
03-12-2016
09:04 AM
1 Kudo
@Artem Ervits, thanks for sharing this link.
... View more
03-12-2016
08:14 AM
3 Kudos
I got below answer: Apache
Flume can be used with HBase using one of the two HBase sinks –
HBaseSink
(org.apache.flume.sink.hbase.HBaseSink) supports secure HBase
clusters and also the novel HBase IPC that was introduced in the
version HBase 0.96. AsyncHBaseSink
(org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance
than HBase sink as it can easily make non-blocking calls to HBase. Working
of the HBaseSink – In
HBaseSink, a Flume Event is converted into HBase Increments or Puts.
Serializer implements the HBaseEventSerializer which is then
instantiated when the sink starts. For every event, sink calls the
initialize method in the serializer which then translates the Flume
Event into HBase increments and puts to be sent to HBase cluster. Working
of the AsyncHBaseSink- AsyncHBaseSink
implements the AsyncHBaseEventSerializer. The initialize method is
called only once by the sink when it starts. Sink invokes the
setEvent method and then makes calls to the getIncrements and
getActions methods just similar to HBase sink. When the sink stops,
the cleanUp method is called by the serializer.
... View more
03-12-2016
08:08 AM
1 Kudo
@Rohan Pednekar, thanks for sharing this link.
... View more
03-12-2016
08:04 AM
3 Kudos
Hi, Can anyone know if Apache Flume provide support for any third party plug-ins? If yes, How we can use it with Flume.
... View more
Labels:
- Labels:
-
Apache Flume
03-12-2016
07:58 AM
2 Kudos
Hi, Can anyone please explain if Flume be used with HBase and how we can use it. If Possibly with example to help me understand.
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache HBase
03-07-2016
05:58 AM
@Neeraj Sabharwal, got the required answer, thus closing this thread.
... View more
03-05-2016
02:59 PM
1 Kudo
@Neeraj Sabharwal, thanks for quick reply.
... View more
03-05-2016
02:56 PM
1 Kudo
Hi, Can anyone please advice me on how to integrate informatica with hadoop.
It will be great if someone who is working currently can reply.
Thanks in advance for the help.
... View more
Labels:
- Labels:
-
Apache Hadoop
03-04-2016
02:36 PM
2 Kudos
I have received below answer: Control the start action using a decision control node as the default start action.
Using Case in decision control node, it is possible to divert to needed action based on your parameter. Want to know if it work.
... View more
03-04-2016
10:57 AM
1 Kudo
Hi, I would like to know, How to load .csv file from spool dir into HDFS using Flume and read its contains? Can anyone help me on this?
... View more
Labels:
- Labels:
-
Apache Flume
03-04-2016
10:35 AM
2 Kudos
Hi, Is it possible to pass oozie action name from property file instead of giving it in xml itself? Let me know if anyone tried this?
... View more
Labels:
- Labels:
-
Apache Oozie
02-24-2016
09:58 AM
1 Kudo
@Pranshu Pranshu, You can use "setrep" command for setting replication factor for files and directories: setrep Usage: hadoop fs -setrep [-R] [-w] <numReplicas> <path> Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path. Options:
The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time. The -R flag is accepted for backwards compatibility. It has no effect. Example: To set replication of an individual file to 3, you can use below command: ./bin/hadoop dfs -setrep -w 3 /path/to/file You can also do this recursively. To change replication of entire HDFS to 3, you can use below command: ./bin/hadoop dfs -setrep -R -w 3 / Exit Code: Returns 0 on success and -1 on error. - Hope this helps you to solve this problem?
... View more
02-23-2016
03:06 PM
1 Kudo
@Ram D, See this http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.0.0/bk_Ambari_Users_Guide/content/_how_to_configure_namenode_high_availability.html Amabri 2.1.0 but its not different from 2.2. - You can use this too for NN and RM + more http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Ambari_Users_Guide/content/ch_managing_service_high_availability.html
... View more