Reply
Highlighted
New Contributor
Posts: 3
Registered: ‎11-29-2018

Strategy for HDFS partitioning from disjoint logs at ingest time

Hi All,

 

  • We have 4 different types of logs that contain different types of information, LogA B C and D
  • LogA contains the timestamp field that we want to partition on
  • LogA-D are all correlated by some UUID within the file
  • These logs arrive in any order sometimes days or weeks apart (for the same UUID),  In other words, for UUID of "123", LogB and LogC might arrive on Monday, but LogA and LogD might arrive Friday.
  • Logically, once we have all 4 logs they could be joined on UUID into a single flat table, then partition on LogA's timestamp
  • Slightly complicating the situation, there can be multiple LogA for any given UUID with different timestamps (days to weeks apart).  This is a valid situation and each LogA for any given UUID must be stored.

 

Questions:

1. Should we just write each log type to it's own Hive table, then do periodic joins between LogA/B/C/D into a new 'master' table partitioned by LogA timestamp?

 

If we do this, my primary concern is data latency.  New data wouldn't be available in the master table until the joins are processed on some interval, right?

 

2. Are there other options possible?  Any recommendations?

 

 

Many Thanks!

Announcements
New solutions