Archives of Support Questions (Read Only)

benoit · ‎01-12-2016

Hi HWX,

I do have HDP2.3.4 with Namenode HA. I submitted Spark jobs properly until one Namenode went down. Since then, no more Spark jobs are starting.

It looks like the HDFS client is not falling back properly to the second Namenode properly:

$ hdfs dfs -ls /tmp
... working fine...
$ spark-shell --master yarn-master
...snip...
16/01/12 22:57:53 INFO ui.SparkUI: Started SparkUI at http://10.10.10.3:7884
16/01/12 22:57:53 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/12 22:57:54 INFO impl.TimelineClientImpl: Timeline service address: http://daplab-wn-12.fri.lan:8188/ws/v1/timeline/
16/01/12 22:53:16 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 0 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
16/01/12 22:53:20 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 1 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]

.. and so on until I lost my patience...

If I'm changing the ip address in /etc/hosts to point 10.10.10.111 to the active namenode, then it is moving forward.

As I said it's a fresh HDP 2.3.4 install, without anything fancy.

Thanks

Benoit

stevel · ‎01-14-2016

java.net.NoRouteToHostException is considered a failure that can be recovered from in any deployment with floating IP addresses. This was essentially the sole form of failover in Hadoop pre NN-HA (HADOOP-6667 added the check). I think we ought to revisit that decision

View solution in original post

benoit · ‎01-31-2018

The temporary solution is to fake hostname of the failing Namenode in /etc/revolv.conf (or equivalent) and make it point to the IP of the healthy Namenode, until your Namenode is back to life.

Cloudera Community

Archives of Support Questions (Read Only)

Spark in YARN with Namenode HA