Created 08-01-2018 02:11 PM
I'm plying around with HBase tables that are managed by Hive. Therefore I run the following commands in Zeppelin:
%hbase # Create Table with 1 CF and 10 Regions create 'my_test', {NAME => 'cf1', VERSIONS => 3}, {SPLITS => ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09']}
After creating the HBase table, I create a external Hive Table with HBaseStorageHandler:
%hive -- CREATE EXTERNAL HIVE MANAGED HBASE TABLE! CREATE EXTERNAL TABLE dmueller.my_test( key String, hashvalue int, valuelist String ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:hashValue,cf1:valueList") TBLPROPERTIES("hbase.table.name" = "my_test", "hbase.mapred.output.outputtable" = "my_test")
Then I fill the HBase table by reading data (1000 rows) from another Hive table into this external Hive Table:
%hive -- INSERT DATA INTO THE MANAGED TABLE INSERT OVERWRITE TABLE dmueller.my_test SELECT concat_ws("_", testname, lotnrc, testnumber, teststufe) as key, hashvalue, valuelist from dmueller.hivetable limit 1000
The problem here is, that the INSERT statement is too slow. It takes around 4 minutes to insert the new values (1000 rows with 6 column to read)! The dmueller.hivetable is a partitioned ORC table with around 60 mio rows and ~500 GB of ORC file size in HDFS.
How can this INSERT statement made faster? What am I doing wrong?
Update - Some more information:
I'm using Hive with Tez. Writing less data takes less time, e.g. one row takes already ~30 sec. (so it seems to be an exponential runtime). Reading the data (e.g. the 1000 rows) directly via Zeppelin Hive (JDBC) interpreter takes ~2 sec. Re-running the same INSERT statement (e.g. with 1000 rows) takes always the same time (+/- a few seconds of course).
Created 08-01-2018 02:13 PM
Formatting of code is not saved, sorry!
Created 08-01-2018 02:59 PM
How are you using Hive (mapreduce, tez, LLAP)?
Can you add some context about where you think slowness would be? e.g. how long does it take to just read that data from Hive (run a select)? Can you tell how much time is actually spent writing data to HBase from logs? If you rerun the same INSERT, does it always take this much time? If you change the LIMIT, does 2000 rows take twice as long to insert?
Created 08-01-2018 07:05 PM
@Josh Elser Thank you for your support.
I updated my question by some information, to answer a few of the questions.
Where can I find information in the logs? Can you tell me, which file(s) on which server (HBase Master, HBase RegionServer) are helpful?
Will make some more "benchmarking" and log searching tomorrow.
Any hints or assumptions yet?
Created 08-01-2018 07:11 PM
Are these inserts into hive, Single Inserts or Batch Inserts?
If Single Inserts then it will take time for the 1000 inserts to complete.
If these are batch inserts then we need to take a look into the HS2 Logs to idenitfy where is the query spending most of its time.
Created 08-01-2018 08:12 PM
Good question. I'm just running the insert into ... select * from ... command, like I could do it e.g. in Beeline or Ambari Hive View (JDBC). Is this running in Single-Insert or Batch mode?