Last night we had a weird situation.
One of Spark processes ended 3 minutes after the backup job started.
That backup just has a simple mysqldump to get all the metadata, followed by a fetchImage from the HDFS.
My question is... is it possible... that a specific Spark job which was running correctly for a few hours, was ended because the backup process started?
This spark job is only doing an access to the HDFS (said by the development team...) so... could it be that the fetchImage is killing something or... signaling something to stop reading from the HDFS?
I'm kind of confused at this moment... this is why I'm asking the question here.
Our cluster is super stable at this point in time, this never happened before. The only thing weird at this point the actual time and day of the backup which is the same as the crashing Spark job. Like... 1+1 = 2...
Could it be something else?
2019-07-21 21:46:26 INFO ContainerManagementProtocolProxy:260 - Opening proxy : "NODE1 :)":8041 2019-07-22 00:33:35 INFO YarnAllocator:54 - Completed container container_e16_1562587047011_1317_01_000013 on host: "NODE 5 =)" (state: COMPLETE, exit status: 1) 2019-07-22 00:33:35 WARN YarnAllocator:66 - Container marked as failed: container_e16_1562587047011_1317_01_000013 on host: "NODE 5 =)". Exit status: 1. Diagnostics: Exception from container-launch. Container id: container_e16_1562587047011_1317_01_000013 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:604) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:399) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Shell output: main : command provided 1 main : run as user is XXXXX main : requested yarn user is XXXXX Writing to tmp file /u11/hadoop/yarn/nm/nmPrivate/application_1562587047011_1317/container_e16_1562587047011_1317_01_000013/container_e16_1562587047011_1317_01_000013.pid.tmp Writing to cgroup task files...