Support Questions

Find answers, ask questions, and share your expertise

Who agreed with this solution

avatar

Short Answer:

Turn off scatter gather

Long Version:

The data transfer b/n container and shuffle service happens through RPC Calls(ChunkFetchRequest, ChunkFetchSuccess and ChunkFetchFailure)

On further debugging with trace level logs, we found that RPC calls were indeed happening b/n the container and the shuffle service and after some time the RPC call's were abruptly suppressed(meaning no more RPC calls were logged) from both shuffle service and container.

On looking into kernel and system activity logs we found the following

xen_netfront: xennet: skb rides the rocket: 19 slots

That means that our ec2 machines were having network packet loss.

More info on this log can be found in the following thread

http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html

So we tried turning off the scatter-gather using the following command.

sudo ethtool -K eth0 sg off

The error was gone after that.

View solution in original post

Who agreed with this solution