Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Processing small files

Processing small files

New Contributor

I have a question related to processing small files. I have a situation where relatively small files (a couple of Megs) come in every hour or so. Data validation and cleansing are the operations that happen on these before they are finally fed into a  data warehouse. 

 

The data cleansing and validation are perfect candidate for MR, however , the Hadoop "small file" problem is pretty famous. I have read several good blog posts on this including a very early one from Tom White. The potential options are : 

1) Just combine the files before processing - This is Ok but I need the filename information because that metadata is key and is used as an identifier

2) Use SequenceFiles - I can build the input as SequenceFiles from the data and thereby leverage the filename information supplied in the key

3) I could use CombineFileInputFormat - I am still digesting the implemenation of this , but seems like a viable option.

 

I see options 2 and 3 as what I should dig deeper into / prototype, in the meanwhile , 1) does anybody know of any generic pros/cons of SequenceFiles vs CombineFileInputFormat and 2) Are there other options ?

 

My blog reads: 

http://bit.ly/1nFEw6y

http://bit.ly/Pofv45

http://bit.ly/1ivAM4k

 

Thanks and regards

Raj

2 REPLIES 2

Re: Processing small files

Cloudera Employee

Raj,

You most likely will do a combination of what you listed.  If you do 1 and 2 first you won't need to do 3.  I would recommend combining the files prior to writing to HDFS if at all possible.  Have you looked at Flume yet?

 

 

https://flume.apache.org/FlumeUserGuide.html

 

http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

 

The pros and cons of the solutions of SequenceFiles vs CombineFileInputFormat is that you are using CombineFileInputFormat to overcome part of the small files problem not getting rid of it.  What I mean by that is you still have lots of small files which will impact your Name Node and Data Node memory footprints.  You will have lots of files that when processed via Map Reduce are very small units of work.  You are combining the units of work by using the special InputFormat but you are not solving the real problem.  If you fix the real problem by merging your data ahead of time, you won't have to worry about combining smalls tasks into larger ones.

 

Woody

 

Re: Processing small files

Explorer

 

Another option would be to wrap the small files in a simple Avro schema, with the filename as a string and the file content as a byte array.  Probably the key advantage there is Avro's support outside Java / the Hadoop ecosystem.  Also lets you add additional metadata alongside the file content if you want to later.

 

If you do decide the go the SequenceFile route, Endgame's BinaryPig work has a nice example of loading a directory of small files into a SequenceFile in HDFS. Link: https://github.com/endgameinc/binarypig/blob/master/binarypig/src/main/java/com/endgame/binarypig/ut...

 

You'd need to switch the MD5 sum key for the filename, but that shouldn't be too hard. :)

Don't have an account?
Coming from Hortonworks? Activate your account here