10-17-2013 01:40 AM
Hello hadoop experts, I have new problem and new question for you.
I have a lot of ( approximately 70,000 )small files, a total of 40 GB.
I've developed Map only java program to analysis this files. I have no reducers and have no output. Only counters.
I've a single node setup of CDH 4.4.0.
I've got hard disk with 150GB. And all Hadoop environment uses it (logs,libs,hdfs data files and so on).
So after I put all the files in HDFS, I was left with a little less than 100GB of free space.
I have successfully started my job, and 10 hours later Hadoop fell down due to the fact that it hasn't enough space to write the log.
I looked on my disk and found than folder mapred/local/taskTracker/hdfs/jobcache/job_xxxx_xxx occupies all available space.
Hadoop has processed approximately 10,000 of files, so that in folder was approximately 10,000 subfolders each of which contains only one file job.xml (weight 8MB). So 8mb * 10,000 ~ 78 Gb.
And here is my question: How can I process 70,000 of files?
(I will need approximately 550GB of free space to process 40 GB of small files!)
Is it possible to configure Hadoop to cleanup after every map?
10-25-2013 07:02 AM
The jobcache by default will max out at 10000 directories. So you should not go above the ~80gb mark there. However, this is configurable and it seems like in your case maybe 5000 directories or even 1000 may be enough. You can set:
To a lower value and see if that helps.
Hope this helps.
10-26-2013 12:29 AM
I went to Services>MapReduce>Configuration(View and Edit) and past in search mapreduce.tasktracker.local.cache.numberdirectories.
I found nothing.
Also I typed only cache and only local, and also nothing.
10-27-2013 08:36 AM
Within CM, there are several tunible properties for all the various modules that are not common enough to have options in CM. To handle those options, you can add them as a safety valve. To do this:
- Go to Services->MapReduce->Configuration(View and Edit).
- Then expand Service-Wide and click on Advanced.
- There you should see "MapReduce Service Configuration Safety Valve for mapred-site.xml". Paste the following in there based on the value you want to set for the number of cache directories:
- Then save the config and restart the Mapreduce service.
This is true for all the various modules. If you don't find the value when you search, it's probably not settable, but every module will have an Advanced with a "Safety Valve", so you can put your properties in there when necessary.
Hope this helps.
10-28-2013 08:36 AM
I'm sorry, I was wrong. It did not help.
I have set MapReduce Service Configuration Safety Valve for mapred-site.xml to
And now I see that I have no space on disk.
In folder mapred/local/taskTracker/hdfs/jobcache/job_201310281050_0001 there are more then 6000 files. (On the other host there are more then 5000).
P.S. I cheked job.xml and there are value mapreduce.tasktracker.local.cache.numberdirectories and it is set to 2000.
10-28-2013 09:23 AM
Could it have something to do with old cache files not being cleaned out from before when you made the change? I think there is a mechanism for retiring these old files and moving them off/deleting them, but I'm not positive if that applies to the actual jobcache files.
Maybe this blog contains the clue?
10-28-2013 10:44 AM
No, I stoped all jobes before changing configuration in MapReduce service, and restarted all cluster.
I also checked folder mapred/local/taskTracker/hdfs/jobcache and i am sure it was empty.
Thank you for the link, but I found nothing about jobcache folder.
Also, after the job failes or completed job folder are deleted from jobcache folder whith all attemp_xxxx folders.