Support Questions

FrozenWave · ‎07-31-2016

Hi all, I have the following design question for my new table in HBase.

Scenario:

-------------

- Table will contain Customers Information
- Table would be refreshed every week by a procedure, inserting new info (see below)
- Row Key would be "Customer ID" (fixed)
- There would be fixed contents columns, e.g. "Name", "Surname"
- There would be variable contents columns, e.g. "Credit", "No. of Services subscribed", "Total Time used Service X"

The question:

------------------

- Should I take advantage of Column Versioning, e.g. every week putting in a new version for Column (e.g) "Total Time used Service X" ? So that the Table would have a fixed number of Columns, some of them with versions and others fixed?
- Or is it a better approach to NOT use Column Versioning, and for every new week of Data coming in just add a new Column named (e.g.) "Total Time used Service X - WEEK YY" ? In this case I'd put in the Week Number in the Column Name to be able to look up for it in later analysis

Please keep in mind that:

----------------------------------

- The main use will be to process the "Variable Information" columns later using a Spark procedure, therefore it is of CRITICAL IMPORTANCE the ability to process each and every "Time Series" easily, on the fly, without convoluted workarounds to manage e.g. Column Name and then loop through Columns in weird ways (this is why at the moment I'm thinking the "Column Versioning" solution would be the best one, but my knowledge of HBase is just basic and I'd like to hear other voices too before making a mistake)

- I'm proposing that the Row Key would be FIXED, but I'm open to other suggestions (e.g. multiple rows with variable Key for the same Customer Entity) if this would be the best approach in the described scenario. I just didn't want to mess up things too much describing my problem

Any insight and/or link to examples for the particular case will be very much appreciated!

Thanks

Harsh J · ‎08-14-2016

I'd recommend not using versions for anything but. If the data you're looking to store via versions does not naturally age out (such as via TTL or via version limits) then its better to store as defined columns instead.

Since your reads are going to be specific, going wider per row with growing # of columns shouldn't be a problem.

Of course the key design could be thought about - but that depends on your primary read case. Scans allow you to grab specific time range slices easily so a TS-using key may be a good option too, but you'll need to think separately about serving profile information.

View solution in original post

Harsh J · ‎08-14-2016

I'd recommend not using versions for anything but. If the data you're looking to store via versions does not naturally age out (such as via TTL or via version limits) then its better to store as defined columns instead.

Since your reads are going to be specific, going wider per row with growing # of columns shouldn't be a problem.

Of course the key design could be thought about - but that depends on your primary read case. Scans allow you to grab specific time range slices easily so a TS-using key may be a good option too, but you'll need to think separately about serving profile information.

FrozenWave · ‎08-15-2016

Thank you. Useful insight and crystal clear argumentation, as usual from you.

I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction.

In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained.

To provide a practical example, in my case:

- A new record every week for the same Customer Entity

- Therefore, column Versioning is NOT used at all! (like you suggested)

- "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>"

- This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly)

- Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>"

This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer.

Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join).

I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people 🙂

Thanks again!

Cloudera Community

Support Questions

HBase Table Design - Multiple "Time Series" Columns for same record ID