Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Strategy for HDFS partitioning from disjoint logs at ingest time

Strategy for HDFS partitioning from disjoint logs at ingest time

New Contributor

Hi All,

 

  • We have 4 different types of logs that contain different types of information, LogA B C and D
  • LogA contains the timestamp field that we want to partition on
  • LogA-D are all correlated by some UUID within the file
  • These logs arrive in any order sometimes days or weeks apart (for the same UUID),  In other words, for UUID of "123", LogB and LogC might arrive on Monday, but LogA and LogD might arrive Friday.
  • Logically, once we have all 4 logs they could be joined on UUID into a single flat table, then partition on LogA's timestamp
  • Slightly complicating the situation, there can be multiple LogA for any given UUID with different timestamps (days to weeks apart).  This is a valid situation and each LogA for any given UUID must be stored.

 

Questions:

1. Should we just write each log type to it's own Hive table, then do periodic joins between LogA/B/C/D into a new 'master' table partitioned by LogA timestamp?

 

If we do this, my primary concern is data latency.  New data wouldn't be available in the master table until the joins are processed on some interval, right?

 

2. Are there other options possible?  Any recommendations?

 

 

Many Thanks!

1 REPLY 1
Highlighted

Re: Strategy for HDFS partitioning from disjoint logs at ingest time

Rising Star

 

I am not sure if i fully understand the usecase, but if your primary concern is latency then typically HBase can be interesting to look at.

 

I would imagine the UUID to be the key, and the different logs could be stored in value columns. You would need to find a solution for the multiple values, perhaps you could append to the value if it is already there.

 

----

 

That being said, your hive setup also does not sound bad. You could consider 1 table per logfile type, and a view to join everything together.

 

Especially if you only append logfiles, and don't update or delete them, you could consider optimizing further by having a view to show all UUIDs which are updated today and a precomputed base table from which you show all other UUIDs.