Can anyone point me to a good resource documenting the "best practice" for storing lots (10-100M per day) of small (~40 KB) XML files in HDFS? We are thinking of using SequenceFile with Snappy block compression to store the raw data files (which will then be processed in a subsequent step).
Does anyone have experience with this approach, best practices, gotchas to watch out for....?
It seems direct storage into a NoSQL DB is a better fit than HDFS, which is optimized for large files. Of course this also depends on how they will be processed downstream.
We configured Flume to batch together 1,000 XML files and store them as a SequenceFile.
Do let me know how you decide to proceed with your protobuf data, I think it is a very similar requirement.