In HDFS, I have a large number of folders, each containing the full file direcory of a computer (ie C:\).
I want to obtain and process the hash, file metadata, and full filenames of all files in those subdirectories. Should I preprocess the files using a script or whatever to extract this data, or is this possible using hadoop, with map reduce? Wouldn't that be much faster? What's the best way to do this?
Eventually, I want to compare this list of hashes with blacklist.