Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

HBase and Hive for incremental data loads

Explorer

One of the popular techniques to upsert (update / delete) data on Hadoop is to use a combination of HBase and Hive. I saw a bunch of architecture slides from various companies, streaming data from Kafka to HBase and then materialize HBase tables once a day to Hive or Impala.

 

We tested this approach on our 6 node cluster and while it works, the last piece (persist HBase tables to Hive) is extremely slow. For example, one of my tables had 22 million rows and it took 1 hour(!) to persist that table to Hive using hive-hbase handler. 

 

I also checked Hive over HBase snapshot feature and it was 2 times faster but still took a long time.

 

Is it supposed to be that slow? It is hard to imagine how it is going to work with billion row tables...

5 REPLIES 5

Super Collaborator

Hi Boris,

 

What are the specs of your worker nodes ? (memory and cpu) and how much is dedicated to yarn ?

 

When you launch your hive query for generating the ouput in an Hive table how many containers can be instanciated at the same time and how many in total are needed to complete the processing ?

 

How many cells per row ?

 

The query launched is a simple "SELECT *" or a more complex query ?

 

An alternative would be to give access directly to your HBase table "as is" without making a clone.

 

 

Explorer

Hi Mathieu, thanks for your response. I just curious conceptionally how other companies are doing it and what real world numbers are for materialization of tables from HBase to Hive. 

 

Our test cluster is 6 nodes cluster. Each has 88 hyperthreaded cores, 256Gb RAM. All 6 are workers. When we tested HBase, I think yarn could use 1Tb of RAM total and 400 cores. The rest was Hbase. My test tables were also very small - one was 1.7M rows and another one 22M rows. I would load initial data with sqoop directly to HBase (which was also slower btw because it would use a bunch of reducers) and then I would start pushing incremental changes to Hbase using sqoop. When I would create external table in HIVE pointed to HBase table directly or a snapshot (which was x2 times faster) and do a CREATE anothertable as SELECT * FROM hbasetable

. I did notice it would use only one mapper (!) and I split the table in HBase a few times:

 

Encounter table 22.8M rows

 

 

Initial load sqoop >> hbase

5237 sec

Reducers took a while

Hbase to hive 3 mappers

11,700 sec

3.5 hrs

Took forever!

Hbase to hive 33 mappers

3635 sec

Still a lot, but x3 faster

Hbase to hive 158 mappers

Killed after 1 hour 40 min

 

 

and this is my tests with a smaller table 

 

Check Hive over Hbase snapshots

Organization table 1.7M

VM

 

No snapshot - count(1)

53 sec

No snapshot - CTAS

create table organization_hbase_copy as select * from organization;

148 sec

snapshot - count(1)

43 sec

Snapshot - CTAS

Drop table if exists organization_hbase_copy;

create table organization_hbase_copy as select * from organization;

63 sec

 

 

does it help?

Explorer

and I forgot to mention why I need to materialize them - we use Impala a lot for analytical queries and we like to keep our queries running fast. While I can use Impala directly with HBase tables, our typical queries will have 7-10 tables  joined and this is not working with HBase. 

Super Collaborator

Do you have access to Cloudera manager UI ?

If yes you can consult there the configuration of Yarn and how many CPU and RAM are allocated per nodes.

You can also check from the Yarn UI how many CPU and RAM total is allocated to Yarn (and how much is currenlty "consumed" for running containers).

 

From what you said I have some inquieries :

>

88 hyperthreaded cores ! Oh gosh that is some powerful servers. Be just sure to not overallocate vcores in Yarn. This could lead to poor performance overall.

 

>

Do you have master nodes (yes) ? But are they on the same nodes as the "worker" ?

 

>

I'm especially curious about that part "I did notice it would use only one mapper".

 

Do you mean that when executing the query "CREATE anothertable as SELECT * FROM hbasetable", the yarn application launched use only one mapper for processing the query ?

- anothertable is a hive table (backed on HDFS)

- hbasetable is a hive table (backed on HBase)

- hbasetable is the table with 22m of rows

 

If yes, that is definetely not normal.

 

>

By the way, which parameters differ from your several tests (the number of mapper shouldn't be that different between execution. Unless you change something) ?

 

>

On the tests with 33 and 158 mappers how was behaving the CPU on the worker nodes ? were they at 10% CPU usage or 100% ?

Was there some contention (pending containers) ?

Average time of one mapper ? (was it the same processing time for both tests, or did the processing time was longer for the test with 158 mappers ?)

Explorer
problem is when I did select from hbase to hive it would not even use a lot
of containers. First time I ran it, it only used 1 mapper. We allocate 1
CPU core and 4Gb of RAM to a yarn container. We can run ~240 containers at
once but we i did the split in hbase cli, it would only use 1 mapper.


Do you mean that when executing the query "CREATE anothertable as SELECT *
FROM hbasetable", the yarn application launched use only one mapper for
processing the query ?
exactly

By the way, which parameters differ from your several tests (the number of
mapper shouldn't be that different between execution. Unless you change
something) ?I did split mytable command in hbase shell. First time I did
the split, it started using 2 mappers. I did a split again, it started
using 5 or 6..I kept doing it and running tests every time 🙂

On the tests with 33 and 158 mappers how was behaving the CPU on the worker
nodes ? were they at 10% CPU usage or 100% ? it was barely using CPUs


Was there some contention (pending containers) ? no, this was the only job
running on our cluster barely using a few containers.

Do you have any benchmarks for similar type of job? what numbers in terms
of rows per sec one should I expect? or some examples with larger tables? I
cannot find any numbers on the web, just pictures of this design. I am
going to Strata next month and I will just stop people and ask them this
question 🙂





Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.