Member since
09-17-2015
70
Posts
79
Kudos Received
20
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1549 | 02-27-2018 08:03 AM | |
1566 | 02-27-2018 08:00 AM | |
1201 | 10-09-2016 07:59 PM | |
400 | 10-03-2016 07:27 AM | |
500 | 06-17-2016 03:30 PM |
03-06-2018
10:41 AM
1 Kudo
Hello XML has a specific structure which will probably change a little in the way you model it in Hbase, for example picking what is the rowkey or how xml fields get projected in column families. This previous statement may be true unless you want to store the while whole XML document just as a raw blob and with no work on it. This is another option With this in mind, in the former approach, usually you would use a parsing engine or ETL to load the data in hbase with the right data model for Hbase. Popular choices would be Spark parsing and loading into Hbase, or a java job this github project may give you some ideas: https://github.com/sreejithpillai/HBaseBulkImportXML
... View more
03-06-2018
10:06 AM
Hello I suppose your standard container size is about 4Gb. Unless you are using cgroups, yarn only allocates based on memory settings, in your scenario 119 containers for 476Gb available is 4G per container. If you want fine grained control on cpu scheduling you will need to configure Yarn to use cgroups. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_yarn-resource-management/content/ch_cgroups.html
... View more
02-27-2018
08:03 AM
1 Kudo
Adding columns to the end of the table works from Hive 1.2 (via HDP 2.5.4). In Hive 2.1, you get additional abilities to change the column types. In the eventual Hive 2.2, you'll get the ability to delete and reorder columns. Hive .13 is a little early for those features
... View more
02-27-2018
08:00 AM
Hello Mithun Having a merge step is definitely more full proof approach. Otherwise you will to know more of your data and distribution and set yourself. A first step would be hive.merge.smallfiles.avgsize that would add the extra step only of the average is not respected. You can also set the number of reducers yourself either statically or dynamically based on the volume of data coming in and if you know your workload this will allow you to calculate the file output size roughly. Seems like a trade off between a more generic approach with a merge step and a more granular approach in which you know your workload. hope this helps
... View more
02-15-2018
05:25 PM
Hello Pedro I think Hbase region groups may be something of help in your need This presentation has information on the feature: https://www.slideshare.net/Hadoop_Summit/achieving-hbase-multitenancy-with-regionserver-groups-and-favored-nodes-77157377
... View more
09-07-2017
08:21 AM
1 Kudo
Hello The explanation might be a little high level to help efficiently. I understand this is for a specific Region Server not all of them or a random one. A couple things can make a Region Server go down . Usual culprits are: Skew, by this I mean this Region Server gets a lot of traffic, for example writes, he will then be flushing the memstore very often and having a lot of GCs to clean out memory and if these last too long he may not be able to heartbeat to zookeeper in the predefined time window. Zookeeper will then take him out. You can log in the logs for the memstore flush and GC clean up. You should also see Zookeeper timeouts warning.
... View more
04-23-2017
01:44 PM
Hello Sami Without the logs and or error message this is left up to guess. A quick look seems to show that your are allocating more than the total amount of ressources and that you do not seem to have defined the support queue you listed.
... View more
04-23-2017
01:32 PM
Hello Mathi I am not sure I understand exactly what is the end purpose you are pursuing so I really can't give insight on the overall architecture. The first thing I would like to point out is that Hbase is a NoSQL store and not well suited for adhoc random analytical queries. Queries that have nothing to do with the Hbase model and keys will suffer on the performance side.This being said there are multiple ways to query Hbase with a SQL interface, Hive is one, Phoenix would be another. I would recommend having a look at Phoenix if applicable you would probably get better performance there. On the Hive handler side multiple tuning elements could help, while probably never really giving low latency. For a very high perspective the way the storage Handler works is that it will query the Hbase online, then bing it back to hive and then apply your query logic. Off course if your query makes use of the hbase model and key it would be much better. Hive and tez being batch in nature querying a snapshot of your table would shave off a lot of the online overhead: set hive.hbase.snapshot.name, and select on that snapshot; this presentation should explain more: https://fr.slideshare.net/HBaseCon/ecosystem-session-3a Multiple other configs could help but a closer look at your query patterns and usage would be need. hope any of this helps
... View more
02-09-2017
10:02 PM
1 Kudo
Hello daniel There is nothing mandatory to do for Hive or Hbase after an HDFS rebalance. As Sergey mentionned in the Hbase case your Hbase files might not be completely located with corresponding region servers anymore and a major compaction could help there. This being said it will not block Hbase from working and over time Hbase will "re-localize" so to speak. You will only incur mild performance degradation depending on your usage pattern. On the Hive front the Metastore,Yarn and Tez will still work together to find your files and start local compute as much as possible so nothing to do either. I will let more knowledgeable experts like Sergey or others weigh in more from detailed technical specs if I missed something. But HDFS rebalance is an operation that happens continuously and should be as transparent as possible to your daily work.
... View more
12-20-2016
03:20 PM
7 Kudos
This article will go over the concepts of
security in an Hbase cluster. More specifically we will concentrate on ACL
based security and how to apply at the different levels of granularity on an
Hbase model. From an overall security perspective and acces
control list , or ACL, is a list of permissions associated with an object ACLs focus on the access rules pattern. ACL logic Hbase access
contol lists are granted on different levels of data abstractions and
cover types of operations. Hbase data layout Before we go further let us clear out the
hierarchical elements that compose the datastorage Hbase CELL : All values written to Hbase are stored in a what is know as a CELL.
(Cell can also be refered to as KeyValue). Cells are identified by a multidimensionnal key
{row, column, qualifier, timestamp}. In
the example above : CELL =>
Rowkey1,CF1,Q11,TS1 COLUMN FAMILY : A column Family groups together
arbitrary cells. TABLE : All Cells belong to a Column family and
are organized into a table. NAMESPACE : Tables in turn belong to Namespaces.
This can thought of as a database to table logic. With this in mind a table’s
fully qualiefied name is Table =>
Namespace :Table (the default namespace can be omitted) Hbase scopes Permissions are
evaluated starting a the widest scope working to the narrowest scope.
Global Namespace Table Column
Family (CF) Column
Qualifier (Q) Cell For example, a permission granted at a tabe
dominates grants done at the column family level. Permissions Hbase can give granular access rights depending
on each scope. Permissions are either zero or more letters from the set RWXCA.
Superuser : a special user that has unlimited access Read(R) : Read
right on the given scope Write(W) : Write
right on the given scope Execute(X) : Coprocessor
execution on the given scope Create(C) : Can
create and delete tables on the given scope Admin(A) : Right to
perform cluster admin operations, fro example granting rights Combining access rights and scopes creates a
complete matrix of access patterns and roles. In order to avoid complex
conflicting rules it can often be useful to build access patterns from roles
and reponsibilities up.
Role
Responsibilites
Superuser
Usually this role should be reserved solely
to the Hbase user
Admin
(A) Operationnal
role : Performs cluster-wide
operations like balancing, assigning regions
(C) DBA type role, creates and
drops tables and namespaces
Namespace Admin
(A) : Manages a specific
namespaces from an operations perspective can take snapshots and splits etc..
(C) From a DBA perspective can
create tables and give access
Table Admin
(A) Operationnal role can
manage splits,compactions ..
(C) can create snpashots,
restore a table etc..
Power User
(RWX) can use the table by writing
or reading data and possibly use coprocessors.
Consumer
(R) User can only read and
consume data
Some actions need a mix of these permissions to be performed
CheckAndPut / CheckAndDelete : thee actions need RW permissions Increment/Append :
only require W permissions A full complete list of the acl matrix can be
found here : http://hbase.apache.org/book.html#appendix_acl_matrix Setting up In order to setup Hbase ACLs you will need to
modify the Hbase-site.xml with the following properties <property>
<name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController,
org.apache.hadoop.hbase.security.token.TokenProvider</value>
</property>
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.hadoop.hbase.security.access.AccessController</value> </property>
<property>
<name>hbase.coprocessor.regionserver.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property>
<property>
<name>hbase.security.exec.permission.checks</name> <value>true</value>
</property> In Ambari this is much easier just enable security and Ambari will automatically set
all these configurations for you. Applying ACLs Now that we have restarted our Hbase cluster
and set up the ACL feature we can start setting up rules. For simplitcity purposes we will use 2 users :
Hbase and testuser. Hbase is the superuser for our cluster and will
let us set the rights accordingly. Namespace As the Hbase use we create an ‘acl’ namespace hbase(main):001:10> create_namespace ‘acl’
0 row(s)in 0.3180 seconds As testuser we will create a table in this new
namespace hbase(main):001:0>
create 'atest','cf'ERROR:
org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient
permissions (user=testuser, scope=default,
params=[namespace=default,table=default:atest,family=cf],action=CREATE) We are not allowed to create a tabe in this
namespace. Super user Hbase will give the rights to testuser. hbase(main):001:10> grant 'testuser','C','@acl'0
row(s) in 0.3360 seconds We can now run the previous command as the
testuser hbase(main):002:0> create 'atest','cf'0
row(s) in 2.3360 seconds We will now open this table to another user
testuser2 hbase(main):002:0> grant 'testuser2','R','@acl'ERROR:
org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions
(user=testuser, scope=acl, params=[namespace=acl],action=ADMIN) Notice we can’t grant rights to other users as
we are missing Admin permissions We can fix this with our Hbase super user hbase(main):002:20> grant 'testuser','A','@acl'0
row(s) in 0.460 seconds
... View more
- Find more articles tagged with:
- Data Processing
- HBase
- hbase-namespace
- How-ToTutorial
Labels:
12-12-2016
12:48 PM
Hello You can definitely upload data in hdfs and then in Hbase through Hive. You can also query Hbase through Hive using the hbase storagehandler. Please refer here for more detailed explanation: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration If this is derived from a Hive table it has a schema so I would also consider the Hive / Phoenix storage handler:https://phoenix.apache.org/hive_storage_handler.html On a performance standpoint overall querying Hbase through Hive should be less performant then querying ORC tables. This beeing said it depends on the query pattern and what the use case is. regards
... View more
10-09-2016
08:07 PM
1 Kudo
Hello This thread might help : https://community.hortonworks.com/questions/24961/how-to-configure-ha-for-knox-gateway-using-any-loa.html and the knox documentation as well: http://knox.apache.org/books/knox-0-6-0/user-guide.html#High+Availability As far as ambari is concerned there are plans,but you can always create your own Ambari stack to deploy a second knox and do the work to make it HA.
... View more
10-09-2016
07:59 PM
1 Kudo
Hello Pan This question is about node ressources and data per region. Not reallu sure what your other configuratiosn like handlers or GC or cache or region replicas are so a little in the dark. The usual formula is (RS memory)*(total memstore fraction)/((memstore size)*(# column families)) This calculation is really about guidelines not a hard truth because it will also depend of actual load and query pattern.Your Regionserver can very well hold much more regionservers but by definition get much more writes since it has the responsibility of more regions. As such it will buffer and flush very often, under heavy load you are prone to having big flush,compaction issues and probably eventually region servers going down because non responsive. Again if out of the 2000 region servers only a couple are actually active it is not as critical, still not a good pattern. Same on the read side if you look at the amount of memory allocated for the cache with that many regions if they are often used you will end up going to disk very often and result in poor read performance. you could look at your hit miss ratio to see how your regions servers go down. Lastly with that kind of distribution if one region server goes down your overall loss is probably very big so not ideal for recovery purposes. Overall 100-200 Regions per RS seems a decent high ball park, depending on ressources too much outside will need some tuning and monitoring. Hope this sheds some light
... View more
10-03-2016
07:27 AM
2 Kudos
Hello Rahul Your question is a little generic so hard to help you out much without things like the service used, the data read etc... This being said since we are in the yarn thread I suppose it is a yarn service like hive or spark. In your shoed I would go to to the yarn UI and job logs to understand where the latency happens: Is it in init phase is yarn waiting to get the containers in which case ressources or max am per queue are possible configurations to look at. Is it in the compute phase itself do you have "mappers" that are much longer in which case you need to look at things like container errors and restart or IO throughput, or data spill. The Tez UI has a very good tool Tez swimlane to get a high level view of the dag and get a sense of where to look. Same thing on the Spark side with the Spark UI. Hope any of this helps
... View more
09-26-2016
08:20 AM
When a client wants to write an HDFS file, it must obtain a lease, which is essentially a lock, to ensure the single-writer semantics. If a lease is not explicitly renewed or the client holding it dies, then it will expire. When this happens, HDFS will close the file and release the lease on behalf of the client. The lease manager maintains a soft limit (1 minute) and hard limit (1 hour) for the expiration time. If you wait the lease will be released and the append will work. This being a work around the question is how did this situation come to be. Did a first process break? do you have storage quota enabled and writing on a maxed out directory?
... View more
09-26-2016
07:39 AM
1 Kudo
@Sunile Manjee I have never seen vs stats on these two bulk loading calls. If you have a phoenix table it would require a little bit of work to get a native Hbase schema to really look enough like a phoenix table for this comparaison to mean anything. Things like complex keys or column types come to mind. If it is just a phoenix view on an hbase table then comparaison might make more sense but you loose a lot of phoenix magic. Overall the performance should not variate much from one to the other aside from any extra work you hide in the Phoenix table, like index,stats... From a pure operations perspective use the bulkload best fitted to the type of your table
... View more
07-12-2016
04:44 PM
Hello pooja From your stack trace your table seems to be bucketed. Can you share your table definition could you also try running the query with the setting: hive.auto.convert.join.noconditionaltask=false
... View more
06-21-2016
10:03 AM
Hello Michel Right now the Hive plan calculation does not reach out to get Hbases stats so currently no added benefit from the Hbase stats. This being said these are questions that are being worked on in different initiatives, so this will likely change in the future.
... View more
06-17-2016
03:30 PM
4 Kudos
hello Timothy There are mutilple ways to integrate these 3 services. As a starting point Nifi will probably be your ingestion flow. During this flow you could - put your data to kafka and have spark read from it - push your nifi data to spark: https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark - you could use and execute script processor and start a pig job In summary you can have a push and forget connection, you can have a push to service and pick in next flow approach, or even execute in processor as corner case maybe hope this shares some insight
... View more
06-12-2016
11:56 AM
The warning DFSInputStream has been closed already is a warning and is fixed in hadoop 2.7.2 : https://issues.apache.org/jira/browse/HDFS-8099 taking 10 minutes to submit the job seems to be another problem from your reduce issue. Do check what your available ressources look like in yarn and how long it takes to get an Application master. i t would interesting in the logs to see if he is waiting or what else is happening. You could also check it your timeline server is responding and not underwater as it could have impacts
... View more
06-12-2016
10:27 AM
Hello Venkadesh It would be worth investigating why your reducer gets a time out error and then gets completed. Do you have a slow node, is a code related error, are your reducers sitting around too long. According to those questions many options are available - you could increase the task timeout, (mapred.task.timeout) - you could force increase the number of reducers to get better distribution (mapred.reduce.tasks=# of reducers) -you could configure when reducers start get to start clsoer to the end of the map phase (mapred.reduce.slowstart.completed.maps) - you could have speculative execution to see if some nodes are faster then others: These are some ideas that come to mind, depending on a closer analysis their might also be other ways. hope any of this helps
... View more
05-03-2016
07:33 AM
Hello Ethan No kafka is not necessary when importing data into atlas. Atals will actually listen in to services like, Hive,Sqoop, Falcon etc... to automatically import data. You can also interact with the Atlas APIs, rest or no, to import your own data, say tags for example. Kafka is very useful for example in the communication with ranger for security policies. As you add tags to data in Atlas you want Ranger to pick them up as soon as possible and kafka is that gateway. Kafka in the not so distant future will also be a service monitored by Atlas as it also is a gateway for data inside hadoop and as such is a source Atlas should do governance for. hope this helps
... View more
05-03-2016
07:22 AM
1 Kudo
Hello Ethan the difference is mainly batch and realtime. By this I mean the bridge will actually import all existing data and actually metadata from the Hive metastore, so all prexisiting tables and definitions, where the hook will actually listen in real time to events happening in Hive. The Atlas documentation explains this here if you want a more detailled explanation http://atlas.incubator.apache.org/Bridge-Hive.html
... View more
05-02-2016
03:47 PM
can you make sure tt the top of the tutorial page, open the gear icon and make sure "hive %hive..." is blue and click save. If not can you share the description and config of your hive interpreter
... View more
05-01-2016
08:17 AM
In order to use node labels you will first have to enable them in yarn: yarn.node-labels.enabled true then set up a label directory, create labels and associate to hosts and queue. Labels are logically accessed through the capacity queue they are associated with. In your case it would just be running your job in the right yarn capacity queue. The documentation has an example that can help you: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/configuring_node_labels.html
... View more
04-30-2016
09:24 AM
hello Sumit If your ulimit is already set to unlimited or a very high number, you could actually getting insight on the number of open files with lsof | wc -l. You may need to increase the max number of filed handles in the os. check fs.file-max to see if this helps. this is to try to solve the cause. An offlineMetaRepair, fix meta should help with the consequence.
... View more
04-29-2016
08:39 AM
Hello Pedro Spark core is a general purpose in memory analytics engine. Adding to spark core things like sparkSQL or SparkML you can do many interesting analytics or Datascience modelling, in a programatic or sql fashion. Maybe this tutorial can help you in your first steps. http://hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ http://hortonworks.com/blog/data-science-hadoop-spark-scala-part-2/
... View more
04-14-2016
08:03 AM
Hello nelson I don't think you need the Hive configuration explicitly set anymore. aka this part "-Djavax.jdo.option.ConnectionURL=jdbc:mysql://testip/hive?createDatabaseIfNotExist=true -Dhive.metastore.uris=thrift://testip:9083 "
... View more
04-13-2016
09:10 AM
1 Kudo
Hello Nelson Instead of putting the Hive info in different properties could you try to add the hive-site.xml : (--files=/etc/hive/conf/hive-site.xml) just to make sure all is consistent. Without this spark could launch a embedded metastore causing the out of memory condition. Could you also share a little bit the app , what type of data ORC,CSV etc... Size of he table let's see if this helps
... View more
04-13-2016
08:18 AM
3 Kudos
Hello Sumit Increasing the zookeeper session timeout is often a quick first fix to GC pause "killing" in Hbase. In the longer run If you have GC pauses is because your process is trying to find memory. There can be architectural approaches to this problem: For example does this happen during heavy writes loads in which case you consider doing bulk load when possible. You can also look at your hbase configuration what is your overall allocated memory for Hbase and how is distributed for writes and reads. Do you flush your memstore often, does this lead to many compactions? Lastly you can look at GC tuning. I won't dive into this one but Lars has done a nice introduction blog post on this here:http://hadoop-hbase.blogspot.ie/2014/03/hbase-gc-tuning-observations.html Hope any of this helps
... View more