Support Questions
Find answers, ask questions, and share your expertise

HBASE for Joins

Super Collaborator


Is good HBASE for joins???



It's always good to de-normalize table before storing in hbase but if it is required, there are some optimal ways to do join between hbase tables.

Super Guru

@Roberto Sancho Hbase uses a different Data model called domain driven design where as relational uses relational theory of objects for data modeling. Using domain driven design you will find the access path into the domain is the primary objective. Therefore using domain driven design a join is not required as the answers are modeled within the table design.

Think of each table as a java objects which are highly demonized once serialized.


Hbase is a Key Value Storage Layer. We have several advanced, optimizer driver join strategies in Splice Machine (Open Source) that are executed in the Spark Execution Engine. Check it out. Cheers, John

Super Guru

@Roberto Sancho

This is a very open ended question. Consider the following. I'll first give an overview of how HBase works and then talk about what will happen in a join (point number 8 below). So bear with me and I hope I will communicate my point.

1. When you store data in HBase, it is SORTED by key and the data is stored as key-value (A row key that corresponds to set of column families /columns). You have billions of rows, each with thousands of columns (embedded within a few column families).

2. This data is distributed across number of nodes in HBase cluster. Each node is called a Region Server. Each region server contains a number of regions.

3. Data is distributed on the cluster using keys as the boundary.

4. When you access the data, you should know the key. HBase knows exactly where the key is and it will return the data you requested for the key in single digit to low double digit milli secs.

5. Given the data is stored with based on keys being sorted, you can also do, what's called a "scan". you specify from which key you want to start reading data and to how far you want to read. Sorted data means, keys are physically sitting together on disk. This saves disk seek time by preventing random reads.

6. Now imagine, if you want to read HBase data but you don't know the key. You say that I want to read where ever a particular column value is "xyz". HBase on it's own doesn't have a way to say which columns contain certainly values. Unlike RDBMS, there are no indexes (you can create indexes in your own tables, but I'll come to that). So basically, in this case, you will be doing what's called a table space scan (potentially on 10's or even 100's of TBs of data)

7. So, you have to design HBase tables in a way that you would always access using the key. key design is the single most important consideration in HBase table design.

8. Now imagine, you want to join two tables. That's just not going to work if you don't know exactly the keys of rows containing data you want to join.

But, here is something beautiful about HBase. With HBase you can have billions of rows with thousands of columns each with only a handful of data nodes. I have seen a 70 node cluster with a trillion plus row table (this was a long narrow table design).

So imagine, how much you can denormalize. How much data you can put together in not only one table but also one row. Depending on your use case, you might have a design which will run your queries without doing any joins. Or you might have a join and you might create a table just to create your own custom index.

Here is the last thing I want you to imagine. Are you considering this for analytical use case? Let's say you solve the problem of joins by denormalizing and having a design that prevents joins. But what about hundreds of other features offered by databases like aggregate functions allowing you to group by easily, analytic windowing functions and so on.

I hope this gives you an idea to move forward on this.

Expert Contributor

I have one question based on your comment 8. What if I join Hbase table based on the entire row key (say the Hbase table is huge ~20 million and the join would involve almost 20 % of the entire table at random, they may be scanning data in different region servers, not just one specific RS), could you please let me know if these kind of joins based on Hbase table row key be efficient as long as it does not scan everything in the same RS?

And also if I create a external table in Hive for the corresponding Hbase table, and joining another Hive table (key is bucketed and not skewed) with the Hbase table based on row key (through hive external table), would Hbase scan be a bottle neck in this scenario ? Please let me know your thoughts.