Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Slow failover from RM when crahsing or stopping network on it

Highlighted

Slow failover from RM when crahsing or stopping network on it

New Contributor

Hi, I'm currently using HDP 2.4.3-0.227 and testing it's H-A capabilities. But I'm encountering a slow failover problem.

My RM is deployed on my namenode2 and namenode1 using QJM H-A thanks to Ambari.

I'm first executing a job with terasort :

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-2.7.1.2.4.3.0-227.jar terasort random-data6 sorted-data36

then shutting down the machine with active RM, the job will stuck a few seconds after and block for around 15 minutes.

After these 15 minutes, it will start again and finish with success. H-A is working but it is obviously too slow for production.

I found an issue similar and better described than me on apache website : https://issues.apache.org/jira/browse/YARN-2578

But I'm not sure how to apply the patch and don't think it is a good idea at first because of the target version.

Do anybody met this problem before and solved it using this patch or something else ?