Okay. So now I have another issue while executing MapReduce jobs. This same MapReduce was running fine yesterday, but now I am getting this error which mentions about ResourceManager. It keeps on trying for ever. I have done a lot of googling before posting this error. I have also restarted Resourcemanager service several times, but with no affect. ResourceManager is shown as green on the Dashboard. Following is the error.
16/02/08 20:33:19 INFO client.RMProxy: Connecting to ResourceManager at master.mydomain.com/172.26.180.6:8050 16/02/08 20:33:20 INFO ipc.Client: Retrying connect to server: master.mydomain.com/172.26.180.6:8050. Already tried 0 time(s);
The execution console has been giving this warning since the last 4 minutes, with no sign of any failure or successful execution of job. But, on thing I observed is that the "ResourceManager Service" is in red :(. Just to add to it, I do not see any log files under /hadoop/yarn/og folder. It is empty. Where should I look for logs if I have to find out what is going wrong?
@Neeraj Sabharwal I am running out of option. Can you suggest something?. I now see that the Resourcemanager service is running fine. It is green in colour, but when I run the jobs, it comes back with the same info about Resourcemanager being tried.. I have been waiting since last 10 minutes, but I am still seeing the same message being displayed "INFO ipc.Client: Retrying connect to server: mycomputer.mydomain.com/172.26.180.6:8050."
@Pradeep kumar Whats the CPU and memory in your cluster? As mentioned earlier, you have to wait and see.
In the meantime, you should check namenode and yarn logs
ls -lrt /var/log/hadoop/hdfs/
ls -l /var/log/hadoop-yarn/yarn/
look for errors in the last updated file
@Neeraj Sabharwal Those log files were like best friends to me!. I found that the log for resourcemanager was showing zookeeper related issues. When I went to the dashboard, I saw that the Zookeeper service was not up. I run the Zookeeper service and then my Jobs started running fine!!!!! . I had not started the Zookeeper service deliberately as I thought it is not required for running a simple MapReduce Job. Maybe I was wrong. Do you have any technical explanation for this Neeraj. Thanks again!
I suggest you enable RM HA as if you encounter similar issues you can force failover to the other RM node or in case active RM fails, it will automatically fail over. Do netstat -tunlp | grep 8088 and ps aux | grep 8088. Investigate whether you have heavy memory utilization and look at tuning RM.
I will try for RM HA, but as another alternative, Is it a good idea trying to move ResourceManager to another node?. Because I have most of the services installed and running on the master node. Maybe the master node is too much loaded in terms of memory utilization?. From the graph on the dashboard, the memory utilization of master node is around 50% only.