Member since
09-17-2015
70
Posts
79
Kudos Received
20
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2371 | 02-27-2018 08:03 AM | |
2170 | 02-27-2018 08:00 AM | |
2595 | 10-09-2016 07:59 PM | |
902 | 10-03-2016 07:27 AM | |
974 | 06-17-2016 03:30 PM |
02-27-2018
08:03 AM
1 Kudo
Adding columns to the end of the table works from Hive 1.2 (via HDP 2.5.4). In Hive 2.1, you get additional abilities to change the column types. In the eventual Hive 2.2, you'll get the ability to delete and reorder columns. Hive .13 is a little early for those features
... View more
02-27-2018
08:00 AM
Hello Mithun Having a merge step is definitely more full proof approach. Otherwise you will to know more of your data and distribution and set yourself. A first step would be hive.merge.smallfiles.avgsize that would add the extra step only of the average is not respected. You can also set the number of reducers yourself either statically or dynamically based on the volume of data coming in and if you know your workload this will allow you to calculate the file output size roughly. Seems like a trade off between a more generic approach with a merge step and a more granular approach in which you know your workload. hope this helps
... View more
04-23-2017
01:44 PM
Hello Sami Without the logs and or error message this is left up to guess. A quick look seems to show that your are allocating more than the total amount of ressources and that you do not seem to have defined the support queue you listed.
... View more
12-20-2016
03:20 PM
7 Kudos
This article will go over the concepts of
security in an Hbase cluster. More specifically we will concentrate on ACL
based security and how to apply at the different levels of granularity on an
Hbase model. From an overall security perspective and acces
control list , or ACL, is a list of permissions associated with an object ACLs focus on the access rules pattern. ACL logic Hbase access
contol lists are granted on different levels of data abstractions and
cover types of operations. Hbase data layout Before we go further let us clear out the
hierarchical elements that compose the datastorage Hbase CELL : All values written to Hbase are stored in a what is know as a CELL.
(Cell can also be refered to as KeyValue). Cells are identified by a multidimensionnal key
{row, column, qualifier, timestamp}. In
the example above : CELL =>
Rowkey1,CF1,Q11,TS1 COLUMN FAMILY : A column Family groups together
arbitrary cells. TABLE : All Cells belong to a Column family and
are organized into a table. NAMESPACE : Tables in turn belong to Namespaces.
This can thought of as a database to table logic. With this in mind a table’s
fully qualiefied name is Table =>
Namespace :Table (the default namespace can be omitted) Hbase scopes Permissions are
evaluated starting a the widest scope working to the narrowest scope.
Global Namespace Table Column
Family (CF) Column
Qualifier (Q) Cell For example, a permission granted at a tabe
dominates grants done at the column family level. Permissions Hbase can give granular access rights depending
on each scope. Permissions are either zero or more letters from the set RWXCA.
Superuser : a special user that has unlimited access Read(R) : Read
right on the given scope Write(W) : Write
right on the given scope Execute(X) : Coprocessor
execution on the given scope Create(C) : Can
create and delete tables on the given scope Admin(A) : Right to
perform cluster admin operations, fro example granting rights Combining access rights and scopes creates a
complete matrix of access patterns and roles. In order to avoid complex
conflicting rules it can often be useful to build access patterns from roles
and reponsibilities up.
Role
Responsibilites
Superuser
Usually this role should be reserved solely
to the Hbase user
Admin
(A) Operationnal
role : Performs cluster-wide
operations like balancing, assigning regions
(C) DBA type role, creates and
drops tables and namespaces
Namespace Admin
(A) : Manages a specific
namespaces from an operations perspective can take snapshots and splits etc..
(C) From a DBA perspective can
create tables and give access
Table Admin
(A) Operationnal role can
manage splits,compactions ..
(C) can create snpashots,
restore a table etc..
Power User
(RWX) can use the table by writing
or reading data and possibly use coprocessors.
Consumer
(R) User can only read and
consume data
Some actions need a mix of these permissions to be performed
CheckAndPut / CheckAndDelete : thee actions need RW permissions Increment/Append :
only require W permissions A full complete list of the acl matrix can be
found here : http://hbase.apache.org/book.html#appendix_acl_matrix Setting up In order to setup Hbase ACLs you will need to
modify the Hbase-site.xml with the following properties <property>
<name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController,
org.apache.hadoop.hbase.security.token.TokenProvider</value>
</property>
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.hadoop.hbase.security.access.AccessController</value> </property>
<property>
<name>hbase.coprocessor.regionserver.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property>
<property>
<name>hbase.security.exec.permission.checks</name> <value>true</value>
</property> In Ambari this is much easier just enable security and Ambari will automatically set
all these configurations for you. Applying ACLs Now that we have restarted our Hbase cluster
and set up the ACL feature we can start setting up rules. For simplitcity purposes we will use 2 users :
Hbase and testuser. Hbase is the superuser for our cluster and will
let us set the rights accordingly. Namespace As the Hbase use we create an ‘acl’ namespace hbase(main):001:10> create_namespace ‘acl’
0 row(s)in 0.3180 seconds As testuser we will create a table in this new
namespace hbase(main):001:0>
create 'atest','cf'ERROR:
org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient
permissions (user=testuser, scope=default,
params=[namespace=default,table=default:atest,family=cf],action=CREATE) We are not allowed to create a tabe in this
namespace. Super user Hbase will give the rights to testuser. hbase(main):001:10> grant 'testuser','C','@acl'0
row(s) in 0.3360 seconds We can now run the previous command as the
testuser hbase(main):002:0> create 'atest','cf'0
row(s) in 2.3360 seconds We will now open this table to another user
testuser2 hbase(main):002:0> grant 'testuser2','R','@acl'ERROR:
org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions
(user=testuser, scope=acl, params=[namespace=acl],action=ADMIN) Notice we can’t grant rights to other users as
we are missing Admin permissions We can fix this with our Hbase super user hbase(main):002:20> grant 'testuser','A','@acl'0
row(s) in 0.460 seconds
... View more
Labels:
12-12-2016
12:48 PM
Hello You can definitely upload data in hdfs and then in Hbase through Hive. You can also query Hbase through Hive using the hbase storagehandler. Please refer here for more detailed explanation: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration If this is derived from a Hive table it has a schema so I would also consider the Hive / Phoenix storage handler:https://phoenix.apache.org/hive_storage_handler.html On a performance standpoint overall querying Hbase through Hive should be less performant then querying ORC tables. This beeing said it depends on the query pattern and what the use case is. regards
... View more
10-09-2016
08:07 PM
1 Kudo
Hello This thread might help : https://community.hortonworks.com/questions/24961/how-to-configure-ha-for-knox-gateway-using-any-loa.html and the knox documentation as well: http://knox.apache.org/books/knox-0-6-0/user-guide.html#High+Availability As far as ambari is concerned there are plans,but you can always create your own Ambari stack to deploy a second knox and do the work to make it HA.
... View more
10-09-2016
07:59 PM
1 Kudo
Hello Pan This question is about node ressources and data per region. Not reallu sure what your other configuratiosn like handlers or GC or cache or region replicas are so a little in the dark. The usual formula is (RS memory)*(total memstore fraction)/((memstore size)*(# column families)) This calculation is really about guidelines not a hard truth because it will also depend of actual load and query pattern.Your Regionserver can very well hold much more regionservers but by definition get much more writes since it has the responsibility of more regions. As such it will buffer and flush very often, under heavy load you are prone to having big flush,compaction issues and probably eventually region servers going down because non responsive. Again if out of the 2000 region servers only a couple are actually active it is not as critical, still not a good pattern. Same on the read side if you look at the amount of memory allocated for the cache with that many regions if they are often used you will end up going to disk very often and result in poor read performance. you could look at your hit miss ratio to see how your regions servers go down. Lastly with that kind of distribution if one region server goes down your overall loss is probably very big so not ideal for recovery purposes. Overall 100-200 Regions per RS seems a decent high ball park, depending on ressources too much outside will need some tuning and monitoring. Hope this sheds some light
... View more
10-03-2016
07:27 AM
2 Kudos
Hello Rahul Your question is a little generic so hard to help you out much without things like the service used, the data read etc... This being said since we are in the yarn thread I suppose it is a yarn service like hive or spark. In your shoed I would go to to the yarn UI and job logs to understand where the latency happens: Is it in init phase is yarn waiting to get the containers in which case ressources or max am per queue are possible configurations to look at. Is it in the compute phase itself do you have "mappers" that are much longer in which case you need to look at things like container errors and restart or IO throughput, or data spill. The Tez UI has a very good tool Tez swimlane to get a high level view of the dag and get a sense of where to look. Same thing on the Spark side with the Spark UI. Hope any of this helps
... View more
09-26-2016
08:20 AM
When a client wants to write an HDFS file, it must obtain a lease, which is essentially a lock, to ensure the single-writer semantics. If a lease is not explicitly renewed or the client holding it dies, then it will expire. When this happens, HDFS will close the file and release the lease on behalf of the client. The lease manager maintains a soft limit (1 minute) and hard limit (1 hour) for the expiration time. If you wait the lease will be released and the append will work. This being a work around the question is how did this situation come to be. Did a first process break? do you have storage quota enabled and writing on a maxed out directory?
... View more
09-26-2016
07:39 AM
1 Kudo
@Sunile Manjee I have never seen vs stats on these two bulk loading calls. If you have a phoenix table it would require a little bit of work to get a native Hbase schema to really look enough like a phoenix table for this comparaison to mean anything. Things like complex keys or column types come to mind. If it is just a phoenix view on an hbase table then comparaison might make more sense but you loose a lot of phoenix magic. Overall the performance should not variate much from one to the other aside from any extra work you hide in the Phoenix table, like index,stats... From a pure operations perspective use the bulkload best fitted to the type of your table
... View more