Support Questions
Find answers, ask questions, and share your expertise

Slow failover from RM when crahsing or stopping network on it

New Contributor

Hi, I'm currently using HDP 2.4.3-0.227 and testing it's H-A capabilities. But I'm encountering a slow failover problem.

My RM is deployed on my namenode2 and namenode1 using QJM H-A thanks to Ambari.

I'm first executing a job with terasort :

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar terasort random-data6 sorted-data36

then shutting down the machine with active RM, the job will stuck a few seconds after and block for around 15 minutes.

After these 15 minutes, it will start again and finish with success. H-A is working but it is obviously too slow for production.

I found an issue similar and better described than me on apache website : https://issues.apache.org/jira/browse/YARN-2578

But I'm not sure how to apply the patch and don't think it is a good idea at first because of the target version.

Do anybody met this problem before and solved it using this patch or something else ?