Hi HWX,
I do have HDP2.3.4 with Namenode HA. I submitted Spark jobs properly until one Namenode went down. Since then, no more Spark jobs are starting.
It looks like the HDFS client is not falling back properly to the second Namenode properly:
$ hdfs dfs -ls /tmp
... working fine...
$ spark-shell --master yarn-master
...snip...
16/01/12 22:57:53 INFO ui.SparkUI: Started SparkUI at http://10.10.10.3:7884
16/01/12 22:57:53 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/12 22:57:54 INFO impl.TimelineClientImpl: Timeline service address: http://daplab-wn-12.fri.lan:8188/ws/v1/timeline/
16/01/12 22:53:16 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 0 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
16/01/12 22:53:20 INFO ipc.Client: Retrying connect to server: daplab-rt-11.fri.lan/10.10.10.111:8020. Already tried 1 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
.. and so on until I lost my patience...
If I'm changing the ip address in /etc/hosts to point 10.10.10.111 to the active namenode, then it is moving forward.
As I said it's a fresh HDP 2.3.4 install, without anything fancy.
Thanks
Benoit