I've some weird problems with Spark running on top of Yarn accessing HDFS. (Spark 2.2 on Cloudera CDH 5.12) I'm guessing that Spark is not the cause so I'm posting this in the HDFS sub.
There are a lot of "java.net.SocketException: Network is unreachable" in the executor stderr linked via the Spark UI, part of a log file: http://support.l3s.de/~zab/spark-errors.txt and the jobs also fail at rather random times with only several MBs worth of the above errors.
This has indeed been caused by the network backend that dropped some packets. I'm not sure why this wasn't "caught" by TCP.
We ended up with setting send_queue_size=256 recv_queue_size=512 for ib_ipoib and krcvqs=4 fpr hfi1. We also updates our OmniPath switch firmware to the current version. We still have _some_ dropped packets, but so far jobs haven't died because of it.