Created on 08-23-2017 12:31 PM - edited 09-16-2022 05:08 AM
One of the popular techniques to upsert (update / delete) data on Hadoop is to use a combination of HBase and Hive. I saw a bunch of architecture slides from various companies, streaming data from Kafka to HBase and then materialize HBase tables once a day to Hive or Impala.
We tested this approach on our 6 node cluster and while it works, the last piece (persist HBase tables to Hive) is extremely slow. For example, one of my tables had 22 million rows and it took 1 hour(!) to persist that table to Hive using hive-hbase handler.
I also checked Hive over HBase snapshot feature and it was 2 times faster but still took a long time.
Is it supposed to be that slow? It is hard to imagine how it is going to work with billion row tables...
Created 08-24-2017 02:43 AM
Hi Boris,
What are the specs of your worker nodes ? (memory and cpu) and how much is dedicated to yarn ?
When you launch your hive query for generating the ouput in an Hive table how many containers can be instanciated at the same time and how many in total are needed to complete the processing ?
How many cells per row ?
The query launched is a simple "SELECT *" or a more complex query ?
An alternative would be to give access directly to your HBase table "as is" without making a clone.
Created 08-24-2017 05:59 AM
Hi Mathieu, thanks for your response. I just curious conceptionally how other companies are doing it and what real world numbers are for materialization of tables from HBase to Hive.
Our test cluster is 6 nodes cluster. Each has 88 hyperthreaded cores, 256Gb RAM. All 6 are workers. When we tested HBase, I think yarn could use 1Tb of RAM total and 400 cores. The rest was Hbase. My test tables were also very small - one was 1.7M rows and another one 22M rows. I would load initial data with sqoop directly to HBase (which was also slower btw because it would use a bunch of reducers) and then I would start pushing incremental changes to Hbase using sqoop. When I would create external table in HIVE pointed to HBase table directly or a snapshot (which was x2 times faster) and do a CREATE anothertable as SELECT * FROM hbasetable
. I did notice it would use only one mapper (!) and I split the table in HBase a few times:
Encounter table 22.8M rows |
|
|
Initial load sqoop >> hbase | 5237 sec | Reducers took a while |
Hbase to hive 3 mappers | 11,700 sec 3.5 hrs | Took forever! |
Hbase to hive 33 mappers | 3635 sec | Still a lot, but x3 faster |
Hbase to hive 158 mappers | Killed after 1 hour 40 min |
|
and this is my tests with a smaller table
Check Hive over Hbase snapshots Organization table 1.7M VM |
|
No snapshot - count(1) | 53 sec |
No snapshot - CTAS create table organization_hbase_copy as select * from organization; | 148 sec |
snapshot - count(1) | 43 sec |
Snapshot - CTAS Drop table if exists organization_hbase_copy; create table organization_hbase_copy as select * from organization; | 63 sec
|
does it help?
Created 08-24-2017 06:02 AM
and I forgot to mention why I need to materialize them - we use Impala a lot for analytical queries and we like to keep our queries running fast. While I can use Impala directly with HBase tables, our typical queries will have 7-10 tables joined and this is not working with HBase.
Created 08-24-2017 07:21 AM
Do you have access to Cloudera manager UI ?
If yes you can consult there the configuration of Yarn and how many CPU and RAM are allocated per nodes.
You can also check from the Yarn UI how many CPU and RAM total is allocated to Yarn (and how much is currenlty "consumed" for running containers).
From what you said I have some inquieries :
>
88 hyperthreaded cores ! Oh gosh that is some powerful servers. Be just sure to not overallocate vcores in Yarn. This could lead to poor performance overall.
>
Do you have master nodes (yes) ? But are they on the same nodes as the "worker" ?
>
I'm especially curious about that part "I did notice it would use only one mapper".
Do you mean that when executing the query "CREATE anothertable as SELECT * FROM hbasetable", the yarn application launched use only one mapper for processing the query ?
- anothertable is a hive table (backed on HDFS)
- hbasetable is a hive table (backed on HBase)
- hbasetable is the table with 22m of rows
If yes, that is definetely not normal.
>
By the way, which parameters differ from your several tests (the number of mapper shouldn't be that different between execution. Unless you change something) ?
>
On the tests with 33 and 158 mappers how was behaving the CPU on the worker nodes ? were they at 10% CPU usage or 100% ?
Was there some contention (pending containers) ?
Average time of one mapper ? (was it the same processing time for both tests, or did the processing time was longer for the test with 158 mappers ?)
Created 08-24-2017 10:39 AM