One of the popular techniques to upsert (update / delete) data on Hadoop is to use a combination of HBase and Hive. I saw a bunch of architecture slides from various companies, streaming data from Kafka to HBase and then materialize HBase tables once a day to Hive or Impala.
We tested this approach on our 6 node cluster and while it works, the last piece (persist HBase tables to Hive) is extremely slow. For example, one of my tables had 22 million rows and it took 1 hour(!) to persist that table to Hive using hive-hbase handler.
I also checked Hive over HBase snapshot feature and it was 2 times faster but still took a long time.
Is it supposed to be that slow? It is hard to imagine how it is going to work with billion row tables...
Hi @Boris Tyukin,
Indeed it can be long depending on the size of the table.
I would recommend you to write a small spark job that will read the hbase table (bulk load), then insert it in you hive table. You can do that by formatting your data in orc format and write it directly in hdfs as a new partition. After you will need to register your new partition for the hive table.
It maybe also possible to do the tranfer of the data efficiently with PIG or SQOOP.
thanks for your response @msumbul. I tried to use sqoop and also hbase MR export tool and both were really slow. I am just curious conceptually how other companies deal with it because I saw it is a very popular design.