Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HFile creation from Hive Table not working

avatar
Expert Contributor

I found the following article about how to fill a HBase table with data from Hive: https://community.hortonworks.com/articles/2745/creating-hbase-hfiles-from-an-existing-hive-table.ht...

I also did the steps for me, and it seems to work. The problem is, when I call the following HiveQL

set hfile.family.path=/tmp/test_hbase/cf
set hive.hbase.generatehfiles=true

INSERT OVERWRITE TABLE testdb.test_hbase SELECT distinct concat_ws("_", name, number, test, step, cast(starttime as STRING)) as k, hashValue, valuelist from testdb.test_orc order by k, hashvalue limit 1000

I need to combine 4 columns to get a unique row key for my HBase table. Another problem is, that my valueList column can contain huge Strings, between 0 and 1 MB.

When I run the query, Tez creates 100 containers for Mapping jobs. This takes a few minutes to complete, which is also slow for 1000 rows, but ok. After the Map step, a Reduce step follows. And this could be the problem in my oppinion, because there's only 1 Reducer for this huge amount of data. This seems to be too less, as the job takes hours now (still not completed yet!)

My questions here:

  • What are the Map and Reduce step doing in this scenario?
  • Why is there only 1 Reducer?
  • Can I somehow change this behavior (e.g. disable Reducing or using more Reducers)?

Thank you!

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Found following article: http://www.openkb.info/2017/05/hive-on-tez-how-to-control-number-of.html

That helped me. I now set the numbers of Tasks to a fix amount.

View solution in original post

5 REPLIES 5

avatar

Last night I was in the search of one of the best portal to which let me play fireboy and watergirl free online games so here it is before you which will also help you to sharp your mind along with great entertainment.

avatar
Expert Contributor

Found following article: http://www.openkb.info/2017/05/hive-on-tez-how-to-control-number-of.html

That helped me. I now set the numbers of Tasks to a fix amount.

avatar
Super Guru

When you are generating HFiles for HBase, the typical pattern is that you have one reducer per Region because HFiles must only contain data for a specific Region. As such, tweaking the number Reducers you get is more of a factor of presplitting your table to increase the number of Reducers (or merging, to reduce the number of Reducers).

avatar
Expert Contributor

@Josh Elser Thank you for that information. I changed my HBase table creation now to following command:

create 'hbase_1m_10r', {NAME => 'cf'}, {SPLITS => ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}<br>

When running the following query:

INSERT OVERWRITE TABLE dmueller.hbase_1m_10r SELECT concat_ws(":", cast((hashvalue % 10) as String), concat_ws("_", name, number, test, step, cast(starttime as STRING))) as k, valuelist from (select * from testdb.test_orc limit 1000000) a distribute by split(k, ":")[0] sort by k<br>

I still have only 1 reducer... Any idea why?

avatar
Super Guru

Does your data actually span all of the regions you created splitpoints for? Or, when this finishes generating the HFile, does the client end up having to split the HFiles (and not just load them?).

The only thing I can guess would be that the HBaseStorageHandler isn't doing something right. Generating only on HFile when you have 10 regions is definitely suboptimal.