Created 07-05-2018 06:28 PM
Short Answer:
Turn off scatter gather
Long Version:
The data transfer b/n container and shuffle service happens through RPC Calls(ChunkFetchRequest, ChunkFetchSuccess and ChunkFetchFailure)
On further debugging with trace level logs, we found that RPC calls were indeed happening b/n the container and the shuffle service and after some time the RPC call's were abruptly suppressed(meaning no more RPC calls were logged) from both shuffle service and container.
On looking into kernel and system activity logs we found the following
xen_netfront: xennet: skb rides the rocket: 19 slots
That means that our ec2 machines were having network packet loss.
More info on this log can be found in the following thread
http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html
So we tried turning off the scatter-gather using the following command.
sudo ethtool -K eth0 sg off
The error was gone after that.