Our cluster ingests messages (you may think of them as email messages) on a daily basis and stores them in sequence files.
In the absence of an obvious key we generate an MD5 hash of each message and use that as the identity.
Furthermore we would like to detect whether or not each message is already on the cluster "somewhere" (i.e., in any sequence file on the cluster).
One solution we started with is maintianing a single file in HDFS that maintains a list of all keys already ingested and to reject import of any message with the same MD5 hash. But loading that file, which contains many millions of entries and has grown to 2GB in size, has proven unwieldly.
So we are looking for another solutions. Perhaps having all MD5 values in HBase? In a Sql table? Some other mechanism that we haven't thought of?
Basically I am taking this forum's temperature. Has anyone out there encountered such a problem and (more importantly) come up with a working, if not elegant, solution?