HFile creation from Hive Table not working

I found the following article about how to fill a HBase table with data from Hive:

I also did the steps for me, and it seems to work. The problem is, when I call the following HiveQL

set hive.hbase.generatehfiles=true

INSERT OVERWRITE TABLE testdb.test_hbase SELECT distinct concat_ws("_", name, number, test, step, cast(starttime as STRING)) as k, hashValue, valuelist from testdb.test_orc order by k, hashvalue limit 1000

I need to combine 4 columns to get a unique row key for my HBase table. Another problem is, that my valueList column can contain huge Strings, between 0 and 1 MB.

When I run the query, Tez creates 100 containers for Mapping jobs. This takes a few minutes to complete, which is also slow for 1000 rows, but ok. After the Map step, a Reduce step follows. And this could be the problem in my oppinion, because there's only 1 Reducer for this huge amount of data. This seems to be too less, as the job takes hours now (still not completed yet!)

My questions here:

  • What are the Map and Reduce step doing in this scenario?
  • Why is there only 1 Reducer?
  • Can I somehow change this behavior (e.g. disable Reducing or using more Reducers)?

Thank you!


When you are generating HFiles for HBase, the typical pattern is that you have one reducer per Region because HFiles must only contain data for a specific Region. As such, tweaking the number Reducers you get is more of a factor of presplitting your table to increase the number of Reducers (or merging, to reduce the number of Reducers).

@Josh Elser Thank you for that information. I changed my HBase table creation now to following command:

create 'hbase_1m_10r', {NAME => 'cf'}, {SPLITS => ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}<br>

When running the following query:

INSERT OVERWRITE TABLE dmueller.hbase_1m_10r SELECT concat_ws(":", cast((hashvalue % 10) as String), concat_ws("_", name, number, test, step, cast(starttime as STRING))) as k, valuelist from (select * from testdb.test_orc limit 1000000) a distribute by split(k, ":")[0] sort by k<br>

I still have only 1 reducer... Any idea why?

Does your data actually span all of the regions you created splitpoints for? Or, when this finishes generating the HFile, does the client end up having to split the HFiles (and not just load them?).

The only thing I can guess would be that the HBaseStorageHandler isn't doing something right. Generating only on HFile when you have 10 regions is definitely suboptimal.