Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Should file hashes, and metadata be processed before or after putting it in hadoop?

Highlighted

Should file hashes, and metadata be processed before or after putting it in hadoop?

New Contributor

In HDFS, I have a large number of folders, each containing the full file direcory of a computer (ie C:\).

I want to obtain and process the hash, file metadata, and full filenames of all files in those subdirectories. Should I preprocess the files using a script or whatever to extract this data, or is this possible using hadoop, with map reduce? Wouldn't that be much faster? What's the best way to do this?

Eventually, I want to compare this list of hashes with blacklist.