07-31-2016 08:06 AM
Hi all, I have the following design question for my new table in HBase.
- Table will contain Customers Information
- Table would be refreshed every week by a procedure, inserting new info (see below)
- Row Key would be "Customer ID" (fixed)
- There would be fixed contents columns, e.g. "Name", "Surname"
- There would be variable contents columns, e.g. "Credit", "No. of Services subscribed", "Total Time used Service X"
- Should I take advantage of Column Versioning, e.g. every week putting in a new version for Column (e.g) "Total Time used Service X" ? So that the Table would have a fixed number of Columns, some of them with versions and others fixed?
- Or is it a better approach to NOT use Column Versioning, and for every new week of Data coming in just add a new Column named (e.g.) "Total Time used Service X - WEEK YY" ? In this case I'd put in the Week Number in the Column Name to be able to look up for it in later analysis
Please keep in mind that:
- The main use will be to process the "Variable Information" columns later using a Spark procedure, therefore it is of CRITICAL IMPORTANCE the ability to process each and every "Time Series" easily, on the fly, without convoluted workarounds to manage e.g. Column Name and then loop through Columns in weird ways (this is why at the moment I'm thinking the "Column Versioning" solution would be the best one, but my knowledge of HBase is just basic and I'd like to hear other voices too before making a mistake)
- I'm proposing that the Row Key would be FIXED, but I'm open to other suggestions (e.g. multiple rows with variable Key for the same Customer Entity) if this would be the best approach in the described scenario. I just didn't want to mess up things too much describing my problem
Any insight and/or link to examples for the particular case will be very much appreciated!
08-14-2016 11:47 AM
08-15-2016 05:15 AM - edited 08-15-2016 05:22 AM
Thank you. Useful insight and crystal clear argumentation, as usual from you.
I have to say in the meanwhile I had the chance to study a bit more, and in the end I came to a conclusion which matches your considerations, therefore I'm glad that apparently I moved in the right direction.
In a matter of fact I've seen this Open Source project here http://opentsdb.net , and I've seen that generally speaking the approach they use is the last that you explained.
To provide a practical example, in my case:
- A new record every week for the same Customer Entity
- Therefore, column Versioning is NOT used at all! (like you suggested)
- "Speaking" record key e.g. "<CUST_ID> + <YYYY-MM-DD>"
- This sort of Key is not monotonically increasing, because the "CUST_ID" part is "stable", so this approach should be good also on a "Table Splitting" perspective (when the Table grows, it will split up "evenly" and all the Splits will take care of a part of the future inserts, balancing the Machines Load evenly)
- Same set of columns for each record containing the new sampled value for that field for that week e.g. "<Total progressive time used Service X>"
This is the approach I used in the end, which has nothing to do with my original idea of using Versions but perfectly matching the last approach you described in your answer.
Regarding the Fixed Values (e.g. "Name", "Surname") I've decided to replicate them every week too, as if they were Time Series themselves... I know, waste of storage. Planning on modifying this structure soon and move the Fixed Values in another Table (Hive or HBase, don't know yet) and pick up the information I'd eventually need at the moment (for instance, during Data Processing, I'll join in the relevant Anagraphic Data in the relevant Dataframes via Join).
I just wanted to write a few more lines about the issue for the posterity. I hope this post will be useful to people :)