Support Questions
Find answers, ask questions, and share your expertise

Best way to store data in Hive/HBase that needs to be deleted while reading

New Contributor

I have the following scenario. Let's say that an etailer like Amazon or Ebay is storing user's orders for two years. My primary analytics jobs reads the data based on dates so I have a partition like so "20201229" where I store all the orders going back two years.

Now I have a requirement that when a user closes his account I need to delete all his/her orders across all days. The deletion might happens while I am running analytical jobs. So my question is which one is the best storage option for this? I only create a new partition per day where I insert all the data for the previous day. Only if there is an order that was received late I will update a previous date partition older than yesterday.

I am thinking about using ACID ORC transactions in Hive to allow for updates. However I am concerned about the performance impact to my read jobs. Also besides the partitioning would it be any bucketing techniques that would help here?

Would there be any technique like partitioning by a range of user ids that might be better? Assuming that I could come up with a range of user ids that is well distributed in that case I will create a hybrid where I have a main partition by date but in this case is every 6 months and below that partition I have another one by user id range assuming I can came up with a good strategy for that.

The other option might be HBase which supports ACID but it is much slow for analytics. I am not sure if this is an option. I would hope somebody can point if there is an advantage here. The other option is a No-Sql database that allows deletes/updates/inserts while reading. I am leading towards this solution unless I hear something better.