We are storing the data in HDFS via Hbase regions and we are writing a hive query on top of hbase and we are executing through Tez engine. While executing this query the data will come from Hbase and it take 1:30hr to finish this TEZ job. table contains 350million record and Its capacity is 50GB.
Is there any Performance tuning technique to run this job in a shorter interval?
Is there any configuration we need to change in Tez to improve the performance?
Is there any configuration we need to change in Hive to improve the performance?
hi @Mathi Murugan,
Hope this article helps you, which also helped me tuning up our cluster !
I am not sure I understand exactly what is the end purpose you are pursuing so I really can't give insight on the overall architecture. The first thing I would like to point out is that Hbase is a NoSQL store and not well suited for adhoc random analytical queries. Queries that have nothing to do with the Hbase model and keys will suffer on the performance side.This being said there are multiple ways to query Hbase with a SQL interface, Hive is one, Phoenix would be another. I would recommend having a look at Phoenix if applicable you would probably get better performance there.
On the Hive handler side multiple tuning elements could help, while probably never really giving low latency. For a very high perspective the way the storage Handler works is that it will query the Hbase online, then bing it back to hive and then apply your query logic. Off course if your query makes use of the hbase model and key it would be much better. Hive and tez being batch in nature querying a snapshot of your table would shave off a lot of the online overhead:
set hive.hbase.snapshot.name, and select on that snapshot; this presentation should explain more:
Multiple other configs could help but a closer look at your query patterns and usage would be need.
hope any of this helps