Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

oozie - mr container fails to start on rhel6 node when app master is on rhel7

avatar
New Member

Hi all - we have a cluster of RHEL6 and RHEL7 nodes.

When oozie launches a workflow and uses a RHEL7 node as application master, the tasks dispatched to RHEL6 nodes fail to execute the launch_container.sh script. As per the log, it looks like the PATH could be not set properly, see below, as it cannot find the "ln" command.

Stack trace: ExitCodeException exitCode=127: /data/d9/yarn/nm/usercache/hdfs/appcache/application_1506341577822_0486/container_e102_1506341577822... line 30: ln: command not found

	at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
	at org.apache.hadoop.util.Shell.run(Shell.java:487)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 127

The log from the job looks like below

2017-09-25 19:05:12,937 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1506341577822_0486_m_000000 Task Transitioned from SCHEDULED to RUNNING
2017-09-25 19:05:13,701 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1506341577822_0486: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:76800, vCores:0> knownNMs=2
2017-09-25 19:05:14,707 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e102_1506341577822_0486_01_000002
2017-09-25 19:05:14,708 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:0 RackLocal:0
2017-09-25 19:05:14,710 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1506341577822_0486_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2017-09-25 19:05:14,710 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1506341577822_0486_m_000000_0: Exception from container-launch.

When MR jobs are not triggered by Oozie, all is fine on those RHEL7 nodes.

I'm able to reproduce the issue with HDP 2.3.4 and 2.3.6.

I could not find any known Oozie issue that got fixed in later versions.

Any help or pointers welcome.

Best,

-Regis

1 ACCEPTED SOLUTION

avatar
Master Guru

@regis piccand / @kirk chou

I was able to resolve similar issue(java.io.IOException: Cannot run program "bash": error=2, No such file or directory) for our customer.

This happens because "/bin" and "/sbin" missing in your $PATH in container launch environment.

$PATH variable gets derived from nodemanager's env and nodemanager get's the env from ambari-agent's /var/lib/ambari-agent/ambari-env.sh.

To fix this, add "/bin" and "/sbin" in /var/lib/ambari-agent/ambari-env.sh, restart ambari-agent followed by nodemanager restart.

Happy Hadooping!

View solution in original post

3 REPLIES 3

avatar
New Member

Hi, I got the same error, can you tell me how to solve it? thanks~~~

avatar
New Member

Hey @kirk chou- sorry for not having posted this earlier.

This was due to differences in system $PATH between the RHEL6 and the RHEL7 hosts (the 'usr/bin/ln' vs '/bin/ln' command in this case).

Oozie forcefully overrides the ShellAction child tasks execution context, especially $PATH, as defined in the application master nodes. If values in $PATH on the application master node differ from $PATH on the worker node, then the task will fail on the worker node.

Hope this helps,

-Regis

avatar
Master Guru

@regis piccand / @kirk chou

I was able to resolve similar issue(java.io.IOException: Cannot run program "bash": error=2, No such file or directory) for our customer.

This happens because "/bin" and "/sbin" missing in your $PATH in container launch environment.

$PATH variable gets derived from nodemanager's env and nodemanager get's the env from ambari-agent's /var/lib/ambari-agent/ambari-env.sh.

To fix this, add "/bin" and "/sbin" in /var/lib/ambari-agent/ambari-env.sh, restart ambari-agent followed by nodemanager restart.

Happy Hadooping!