Created 05-06-2016 11:26 AM
I have a 4 node cluster. I am running a MapReduce job on this cluster. The input file is a JSON file of the size 1.53 GB. The Mapper task is reading a JSON record and manipulating the text. I observed the following, after I executed the Job.
1) There are 15 Mapper tasks, which is correct. (no issues here)
2) Only 1% of the job is processed in 50 minutes, which is very slow.
3) Only 4 mapper task is shown running.
4) Two mappers are running on Machine1 and other two mapers are running on Machine2.
5) Mapper task 1 in Machine 1 is showing total 21627027 as read and keeps increasing after a few seconds.
Following is what I need to understand:
1) Why only two Nodes have all the Mapper tasks running. Why are the other nodes not running any mapper?
2) If one mapper is per 128 MB file block, why the mapper task on machine 1 is showing 21627027 byes (21 MB) of data ? (Edited: I had mentioned 21120 MB, which was a calculation mistake. The correct figure is 21 MB.)
Created 05-06-2016 01:19 PM
Q1) Why only two Nodes have all the Mapper tasks running. Why are the other nodes not running any mapper?
A: Mappers and reducers only run in Yarn containers on nodes running Node Manager. Click on your Yarn service in Ambari, and in the Summary tab check how many NMs do you have. You most likely have only 2 NMs on Machine1 and Machine2. Now, why only 2 mappers per machine? That depends on your Yarn and MapReduce settings. If no other jobs are running it means each node can run only 2 mappers at a time. To confirm check your yarn.nodemanager.resource.memory-mb in Yarn, and mapreduce.map.memory.mb in Mapreduce.
Q2) If one mapper is per 128 MB file block, why the mapper task on machine 1 is showing 21627027 byes (21120 MB) of data ?
A: 21627027 bytes is 21,627,027 or about 21M, not 21120M and so less than 128M. Note also that all blocks are not 128M, some are smaller (if a file has 150M one block will be "full" 128M, another one will be only 22M).
Created 05-06-2016 01:19 PM
Q1) Why only two Nodes have all the Mapper tasks running. Why are the other nodes not running any mapper?
A: Mappers and reducers only run in Yarn containers on nodes running Node Manager. Click on your Yarn service in Ambari, and in the Summary tab check how many NMs do you have. You most likely have only 2 NMs on Machine1 and Machine2. Now, why only 2 mappers per machine? That depends on your Yarn and MapReduce settings. If no other jobs are running it means each node can run only 2 mappers at a time. To confirm check your yarn.nodemanager.resource.memory-mb in Yarn, and mapreduce.map.memory.mb in Mapreduce.
Q2) If one mapper is per 128 MB file block, why the mapper task on machine 1 is showing 21627027 byes (21120 MB) of data ?
A: 21627027 bytes is 21,627,027 or about 21M, not 21120M and so less than 128M. Note also that all blocks are not 128M, some are smaller (if a file has 150M one block will be "full" 128M, another one will be only 22M).
Created 05-10-2016 03:36 AM
Thanks. I have understood it now very well. The problem was indeed Nodemanager not available in the other two nodes. Also, I made a mistake in my calculation of MB, due to which I misunderstood the process.