I've some weird problems with Spark running on top of Yarn accessing HDFS. (Spark 2.2
on Cloudera CDH 5.12) I'm guessing that Spark is not the cause so I'm posting this in the HDFS sub.
There are a lot of "java.net.SocketException: Network is unreachable" in
the executor stderr linked via the Spark UI, part of a log file:
http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at
rather random times with only several MBs worth of the above
Going directly for the container logs also gave the following, see exceptions at the end: http://www.l3s.de/~zab/stderr.txt
In the driver output I get the following:
These errors usually go hand in hand with some dropped packets, but I
would assume that TCP can actually handle that?
The network backend is based on Intel OmniPath hardware running in
connected mode with a MTU of 1500 (just as a save default at the
The nodes can also ping each other without a problem, their DNS
configuration is also the same. Same hosts file deployed to all hosts
via Ansible and same data configured in unbound DNS forwarder.
I've several code snippets that manifest with that problem, current
val data = session.read
val filtered = data
.filter(_.elem === "A@/href")
val transformed = filtered.map(e => e.copy(date = e.date.slice(0, 10) + "T00:00:00.000-00:00"))
.dropDuplicates(Array("src", "date", "dst"))
The input data is roughly 2.1TB (~ 500 billion lines I think) and on HDFS.
I'm honestly running out of ideas on how to debug this problem. I'm half
thinking that the above errors are just masking the real problem.
I would greatly appreciate any help!
This has indeed been caused by the network backend that dropped some packets. I'm not sure why this wasn't "caught" by TCP.
We ended up with setting send_queue_size=256 recv_queue_size=512 for ib_ipoib and krcvqs=4 fpr hfi1. We also updates our OmniPath switch firmware to the current version. We still have _some_ dropped packets, but so far jobs haven't died because of it.