Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to see Mapreduce Spill Disk Activity

avatar
Explorer

Hi 

 

Just started learning Hadoop, I have no idea about as to how to check if a mapreduce job is making spill or not . if so correct me if i am wrong we have to increase io-sort size , please help me out with this.

 

1 . Also what are all the  other parameters that needs to be checked if there is too much spill in mapred-site.xml , hadoop-env.sh files. 

 

 

 

1 ACCEPTED SOLUTION

avatar
Champion
The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot.

The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb.

mapreduce.task.io.sort.mb
mapreduce.map.sort.spill.percent

View solution in original post

12 REPLIES 12

avatar
Explorer

@mbigelow My English is not that good so I assume from ur answer that I can I set more than 8gb in yarn.scheduler.maximum-allocation-mb  please correct me if I am wrong. 

avatar
Champion
@matt123 You got it.

avatar
Explorer

@mbigelow Cant Thank you engouh Mate