I have a question related to processing small files. I have a situation where relatively small files (a couple of Megs) come in every hour or so. Data validation and cleansing are the operations that happen on these before they are finally fed into a data warehouse.
The data cleansing and validation are perfect candidate for MR, however , the Hadoop "small file" problem is pretty famous. I have read several good blog posts on this including a very early one from Tom White. The potential options are :
1) Just combine the files before processing - This is Ok but I need the filename information because that metadata is key and is used as an identifier
2) Use SequenceFiles - I can build the input as SequenceFiles from the data and thereby leverage the filename information supplied in the key
3) I could use CombineFileInputFormat - I am still digesting the implemenation of this , but seems like a viable option.
I see options 2 and 3 as what I should dig deeper into / prototype, in the meanwhile , 1) does anybody know of any generic pros/cons of SequenceFiles vs CombineFileInputFormat and 2) Are there other options ?
My blog reads:
Thanks and regards
You most likely will do a combination of what you listed. If you do 1 and 2 first you won't need to do 3. I would recommend combining the files prior to writing to HDFS if at all possible. Have you looked at Flume yet?
The pros and cons of the solutions of SequenceFiles vs CombineFileInputFormat is that you are using CombineFileInputFormat to overcome part of the small files problem not getting rid of it. What I mean by that is you still have lots of small files which will impact your Name Node and Data Node memory footprints. You will have lots of files that when processed via Map Reduce are very small units of work. You are combining the units of work by using the special InputFormat but you are not solving the real problem. If you fix the real problem by merging your data ahead of time, you won't have to worry about combining smalls tasks into larger ones.
Another option would be to wrap the small files in a simple Avro schema, with the filename as a string and the file content as a byte array. Probably the key advantage there is Avro's support outside Java / the Hadoop ecosystem. Also lets you add additional metadata alongside the file content if you want to later.
If you do decide the go the SequenceFile route, Endgame's BinaryPig work has a nice example of loading a directory of small files into a SequenceFile in HDFS. Link: https://github.com/endgameinc/binarypig/blob/master/binarypig/src/main/java/com/endgame/binarypig/ut...
You'd need to switch the MD5 sum key for the filename, but that shouldn't be too hard. :)