I have the following scenario. Let's say that an etailer like Amazon or Ebay is storing user's orders for two years. My primary analytics jobs reads the data based on dates so I have a partition like so "20201229" where I store all the orders going back two years. Now I have a requirement that when a user closes his account I need to delete all his/her orders across all days. The deletion might happens while I am running analytical jobs. So my question is which one is the best storage option for this? I only create a new partition per day where I insert all the data for the previous day. Only if there is an order that was received late I will update a previous date partition older than yesterday. I am thinking about using ACID ORC transactions in Hive to allow for updates. However I am concerned about the performance impact to my read jobs. Also besides the partitioning would it be any bucketing techniques that would help here? Would there be any technique like partitioning by a range of user ids that might be better? Assuming that I could come up with a range of user ids that is well distributed in that case I will create a hybrid where I have a main partition by date but in this case is every 6 months and below that partition I have another one by user id range assuming I can came up with a good strategy for that. The other option might be HBase which supports ACID but it is much slow for analytics. I am not sure if this is an option. I would hope somebody can point if there is an advantage here. The other option is a No-Sql database that allows deletes/updates/inserts while reading. I am leading towards this solution unless I hear something better.
... View more
I have a set of log data coming which I am saving to Hive. It is a data produce from log files to track and user. I keep these records for two years. The user can send a request for their logs to be deleted. I would need to go back two years and delete those records. My access pattern is by day so my queries can ask for: the last 6 months what was the most access resource and from which provider.
Each day I store 200 million records (There are more columns than shown below probably 30 more) My question is given the layout below what kind of optimization I can do to my table to be able to delete records if any? Is there another storage in Hadoop better than Hive for this like HBase? My Jobs runs on Spark and can take an hour to process the data. I don't think I need ACID transactions because I would probably run the deletes on the weekends when there is no job accessing the tables or after all jobs finish.
CREATE TABLE web_log ( user_id STRING, ip_address STRING, url_resource STRING, url_referrer STRING ) PARTITIONED BY (access_dt STRING COMMENT 'yyyymmdd', internet_provider STRING) STORED AS ORC;
... View more