Encounter a weird issue where sometimes BDR is successful then when i re-run the same BDR Job i encounter socket timeout issue. As per screenshot:
It is a multi-homing cluster, whereby we have 1 public network(172.x.x.x) and 1 private network(10.x.x.x). I applied the namenode.rpcbindhost configurations as per https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set destination proxyuser on source.
Any advice is deeply appreciated.
Hi @EricL ,
I realized failure occurs whenever a this particular node(namenode, either active or passive) is running the job. Whenever it hit step 3 " Trigger a HDFS Replication ", it will fail with the socket timeout error. I have a personal lab, with the same design architecture, and it will fail as well on the same particular namenode.
Hi @RobinRo ,
I wonder if the particular node is a mutlihomed host? I found this jira and wonder if it can help: