Created 02-23-2015 09:44 PM
Hello gurus,
Can anyone point me to a good resource documenting the "best practice" for storing lots (10-100M per day) of small (~40 KB) XML files in HDFS? We are thinking of using SequenceFile with Snappy block compression to store the raw data files (which will then be processed in a subsequent step).
Does anyone have experience with this approach, best practices, gotchas to watch out for....?
Thanks,
Martin
Created 03-13-2015 09:55 AM
It seems direct storage into a NoSQL DB is a better fit than HDFS, which is optimized for large files. Of course this also depends on how they will be processed downstream.
Cheers,
Miles
Created 08-27-2015 05:57 AM
@Martin: Which solution did you choose?
Best, Thomas
Created 09-02-2015 05:13 AM
Hi Thomas,
We configured Flume to batch together 1,000 XML files and store them as a SequenceFile.
Do let me know how you decide to proceed with your protobuf data, I think it is a very similar requirement.
Regards,
Martin