11-30-2018 11:55 AM
1. Should we just write each log type to it's own Hive table, then do periodic joins between LogA/B/C/D into a new 'master' table partitioned by LogA timestamp?
If we do this, my primary concern is data latency. New data wouldn't be available in the master table until the joins are processed on some interval, right?
2. Are there other options possible? Any recommendations?
04-09-2019 08:02 AM
I am not sure if i fully understand the usecase, but if your primary concern is latency then typically HBase can be interesting to look at.
I would imagine the UUID to be the key, and the different logs could be stored in value columns. You would need to find a solution for the multiple values, perhaps you could append to the value if it is already there.
That being said, your hive setup also does not sound bad. You could consider 1 table per logfile type, and a view to join everything together.
Especially if you only append logfiles, and don't update or delete them, you could consider optimizing further by having a view to show all UUIDs which are updated today and a precomputed base table from which you show all other UUIDs.