Created 01-12-2016 11:26 PM
Hi HWX,
I do have HDP2.3.4 with Namenode HA. I submitted Spark jobs properly until one Namenode went down. Since then, no more Spark jobs are starting.
It looks like the HDFS client is not falling back properly to the second Namenode properly:
$ hdfs dfs -ls /tmp ... working fine... $ spark-shell --master yarn-master ...snip... 16/01/12 22:57:53 INFO ui.SparkUI: Started SparkUI at http://10.10.10.3:7884 16/01/12 22:57:53 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 16/01/12 22:57:54 INFO impl.TimelineClientImpl: Timeline service address: http://daplab-wn-12.fri.lan:8188/ws/v1/timeline/ 16/01/12 22:53:16 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 0 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail] 16/01/12 22:53:20 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 1 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
.. and so on until I lost my patience...
If I'm changing the ip address in /etc/hosts to point 10.10.10.111 to the active namenode, then it is moving forward.
As I said it's a fresh HDP 2.3.4 install, without anything fancy.
Thanks
Benoit
Created 01-14-2016 05:04 PM
java.net.NoRouteToHostException is considered a failure that can be recovered from in any deployment with floating IP addresses. This was essentially the sole form of failover in Hadoop pre NN-HA (HADOOP-6667 added the check). I think we ought to revisit that decision
Created 01-31-2018 04:03 PM
The temporary solution is to fake hostname of the failing Namenode in /etc/revolv.conf (or equivalent) and make it point to the IP of the healthy Namenode, until your Namenode is back to life.