Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Performance of Joining big Hbase table on rowkey

Highlighted

Performance of Joining big Hbase table on rowkey

Expert Contributor

I have read in many places that Hbase does not perform well for joins but has good performance when performed a random read/write. My question is would it still give good performance if there is a bulk scan of Hbase table using full rowkey (like say scanning 30% of table where the scanned rowkeys are random and distributed in nature and not query just a few regions of the table)

Consider a Hbase table whose regions are equally distributed across many region servers. If a external table is created for such a table in Hive and this external table is joined with another Hive managed table based on Rowkey from Hbase table, would huge number of rowkey scans during the join on Hbase table be a performance bottleneck in this scenario?

If so, could you please explain why?

Thanks!

1 REPLY 1

Re: Performance of Joining big Hbase table on rowkey

Mentor

I can't speak for performance hit in this scenario but considering other workloads hitting HBase at the same time it might still be an issue. There's a great feature available where you map a hive schema to an HBase snapshot which promises a lot better performance than hitting HBase directly. Please take a look at https://community.hortonworks.com/content/kbentry/14806/working-with-hbase-and-hive-wip.html for an example. Essentially, you map hive external table to an HBase snapshot, run your analysis and then remove snapshot. This bypasses HBase RS all together and uses MR instead.