Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Performance impact of columnfamily and Version in Hbase

Highlighted

Performance impact of columnfamily and Version in Hbase

New Contributor

In Hbase a table which contains 30 column but have a single column family

create 'my_table', { NAME => 'my_family', VERSIONS => 5 }

want to increase the version to 10,000

create 'my_table', { NAME => 'my_family', VERSIONS => 10000 }

when change the version to 10K it will be changed to all columns but can requirement is only to change for 2 column

what will be the performance impact in both cases

  1. make the two different column family and change version accordingly
  2. Changed version for all column
4 REPLIES 4
Highlighted

Re: Performance impact of columnfamily and Version in Hbase

It will be good if you create two column family as preserving unnecessary version for other 28 columns will adversely affect performance as well as storage.

Highlighted

Re: Performance impact of columnfamily and Version in Hbase

New Contributor

Confused with these two statement

if you can help what and how it will impact

if you query both column family together then specifying more column family lead to flush more and more I/O operations

if there are two column family A and B and cardinality of A is 1million and B is 1Billion, Data of A is spread across many regions and regions server . This makes mass scans for ColumnFamilyA less efficient.

Re: Performance impact of columnfamily and Version in Hbase

when you need 10k version for only two column and you are ready to let hbase to prune excess versions for rest 28 columns then I think it's better to go with two column families.

as in this way , your store file will not grow much by not storing of unnecessary versions of 28 columns. This will inturn help in less split during compaction. So IO performance will be better with smaller and lesser store files.

your question:- if you query both column family together then specifying more column family lead to flush more and more I/O operations

Sorry , I didn't understand here why there will be a flush if we query a column family. You mean if we write in both the column families together will there be any impact. No, there will not be any observable impact in writing two column families together.

Your question:- if there are two column family A and B and cardinality of A is 1million and B is 1Billion, Data of A is spread across many regions and regions server.This makes mass scans for ColumnFamilyA less efficient.

Yes, regions are distributed as per the rowkey, so even if A has 1 million rows and has a good distribution across rowkeys. then yes you may need to scan all those regions. I don't think that will impact much but this can only be avoided by using different table for these two high versioned columns.

Highlighted

Re: Performance impact of columnfamily and Version in Hbase

Is 10k version is the necessity, can't this version/timestamp be moved in your row key data model?

Don't have an account?
Coming from Hortonworks? Activate your account here