Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why HBase is column oriented ?

Why HBase is column oriented ?

Hi,

I was going through some docs about the HBase and got confused with its column oriented design

HBase is good suit for record level read & update ? Wondering ,if the data is stored in column way, then fetching entire row will take time and as per my understanding, If data is stored in column way, then it is good for analytical query and row oriented is good fit for updating record level.

Can anyone clarify, How HBase storing data in column oriented and giving good performance on scanning data.

1 REPLY 1
Highlighted

Re: Why HBase is column oriented ?

HBase is a good fit for record-oriented data access.

HBase (BigTable) describe an data-layout called "locality groups". A locality group is a grouping of columns across rows which are frequently accessed together. This is meant to avoid a penalty of having to filter data (server-side) to answer a query which the client definitively knows it does not want.

In HBase, each column family is a locality group. Within a column family, all the columns (qualifiers) in a row are stored adjacent in a single file. Thus, there is absolutely no concern when fetching data for a row in HBase when only fetching columns in a single family -- this is the optimal case.

There is a (very) minor concern when you request all the columns for a row which has multiple column families combined. In this case, the data for each family is stored in separate files. Because each file is sorted by row-order, it is a trivial merge of sorted data which is very efficient. So, as long as you are not implementing a schema which has each column in its own family (which would be considered a bad schema decision), the perceived cost would be relatively low (if you had one column per family, this would result in opening a file for each column queried which would increase HBase memory, HDFS utilization, etc). Most use-cases use a few column families (e.g. 1-5), many only using a single family.

Don't have an account?
Coming from Hortonworks? Activate your account here