Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark in YARN with Namenode HA

avatar
New Member

Hi HWX,

I do have HDP2.3.4 with Namenode HA. I submitted Spark jobs properly until one Namenode went down. Since then, no more Spark jobs are starting.

It looks like the HDFS client is not falling back properly to the second Namenode properly:

$ hdfs dfs -ls /tmp
... working fine...
$ spark-shell --master yarn-master
...snip...
16/01/12 22:57:53 INFO ui.SparkUI: Started SparkUI at http://10.10.10.3:7884
16/01/12 22:57:53 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/12 22:57:54 INFO impl.TimelineClientImpl: Timeline service address: http://daplab-wn-12.fri.lan:8188/ws/v1/timeline/
16/01/12 22:53:16 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 0 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
16/01/12 22:53:20 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 1 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]

.. and so on until I lost my patience...

If I'm changing the ip address in /etc/hosts to point 10.10.10.111 to the active namenode, then it is moving forward.

As I said it's a fresh HDP 2.3.4 install, without anything fancy.

Thanks

Benoit

1 ACCEPTED SOLUTION

avatar

java.net.NoRouteToHostException is considered a failure that can be recovered from in any deployment with floating IP addresses. This was essentially the sole form of failover in Hadoop pre NN-HA (HADOOP-6667 added the check). I think we ought to revisit that decision

View solution in original post

10 REPLIES 10

avatar
New Member

The temporary solution is to fake hostname of the failing Namenode in /etc/revolv.conf (or equivalent) and make it point to the IP of the healthy Namenode, until your Namenode is back to life.