Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best practice for storing lots of small XML files

Best practice for storing lots of small XML files

Rising Star

Hello gurus,

 

Can anyone point me to a good resource documenting the "best practice" for storing lots (10-100M per day) of small (~40 KB) XML files in HDFS? We are thinking of using SequenceFile with Snappy block compression to store the raw data files (which will then be processed in a subsequent step).

 

Does anyone have experience with this approach, best practices, gotchas to watch out for....?

 

Thanks,

Martin

3 REPLIES 3

Re: Best practice for storing lots of small XML files

Contributor

It seems direct storage into a NoSQL DB is a better fit than HDFS, which is optimized for large files.  Of course this also depends on how they will be processed downstream.

 

Cheers,

Miles

 

Re: Best practice for storing lots of small XML files

Explorer

@Martin: Which solution did you choose?

 

Best, Thomas

Highlighted

Re: Best practice for storing lots of small XML files

Rising Star

Hi Thomas,

We configured Flume to batch together 1,000 XML files and store them as a SequenceFile.

Do let me know how you decide to proceed with your protobuf data, I think it is a very similar requirement.

 

Regards,

Martin

 

Don't have an account?
Coming from Hortonworks? Activate your account here