Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Is there a way to detect (and reject) duplicate keys in sequence files?

Is there a way to detect (and reject) duplicate keys in sequence files?

Explorer

Our cluster ingests messages (you may think of them as email messages) on a daily basis and stores them in sequence files.

 

In the absence of an obvious key we generate an MD5 hash of each message and use that as the identity.

 

Furthermore we would like to detect whether or not each message is already on the cluster "somewhere" (i.e., in any sequence file on the cluster).

 

One solution we started with is maintianing a single file in HDFS that maintains a list of all keys already ingested and to reject import of any message with the same MD5 hash.  But loading that file, which contains many millions of entries and has grown to 2GB in size, has proven unwieldly.

 

So we are looking for another solutions.  Perhaps having all MD5 values in HBase?  In a Sql table?  Some other mechanism that we haven't thought of?

 

Basically I am taking this forum's temperature.  Has anyone out there encountered such a problem and (more importantly) come up with a working, if not elegant, solution?