Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: The Cloudera Community will undergo maintenance on Saturday, August 17 at 12:00am PDT. See more info here.

Oozie Workflow Shell Action getting killed - Memory problem?

Oozie Workflow Shell Action getting killed - Memory problem?

Rising Star

Hi all, I'm trying to set up an Oozie Workflow in Hue (Shell Action), but I can't find evidence of macroscopic errors in the logs. Nevertheless, the action gets killed by the Container Manager.

 

I get an "exit code 143", which is probably the problem (but I don't really get what it means).

 

The only reason I can think of for this is that at some stage the Virtual Memory used by the Container appears to be greater than the maximum value (please see below).

 

This is the relevant snippet of what I find in the events log:

 

 

Jul 15, 10:55:37.062 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource	
Resource hdfs://XXXX.XXXX.XXX:8020/user/hue/oozie/workspaces/hue-oozie-1468572916.13/lib/move_scadenza_smartnumber.sh(->/hdp01/yarn/nm/filecache/521/move_scadenza_smartnumber.sh) transitioned from DOWNLOADING to LOCALIZED
Jul 15, 10:55:37.063 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource	
Resource hdfs://XXXX.XXXX.XXX:8020/user/msgnet/.staging/job_1468572612013_0001/job.xml(->/hdp01/yarn/nm/usercache/msgnet/appcache/application_1468572612013_0001/filecache/10/job.xml) transitioned from DOWNLOADING to LOCALIZED
Jul 15, 10:55:37.064 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container	
Container container_1468572612013_0001_01_000002 transitioned from LOCALIZING to LOCALIZED
Jul 15, 10:55:37.191 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container	
Container container_1468572612013_0001_01_000002 transitioned from LOCALIZED to RUNNING
Jul 15, 10:55:37.207 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor	
launchContainer: [bash, /hdp01/yarn/nm/usercache/msgnet/appcache/application_1468572612013_0001/container_1468572612013_0001_01_000002/default_container_executor.sh]
Jul 15, 10:55:39.415 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl	
Starting resource-monitoring for container_1468572612013_0001_01_000002
Jul 15, 10:55:39.473 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl	
Memory usage of ProcessTree 1672 for container-id container_1468572612013_0001_01_000002: 85.9 MB of 1 GB physical memory used; 1.4 GB of 2.1 GB virtual memory used
Jul 15, 10:55:42.587 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl	
Memory usage of ProcessTree 1672 for container-id container_1468572612013_0001_01_000002: 168.9 MB of 1 GB physical memory used; 1.4 GB of 2.1 GB virtual memory used
Jul 15, 10:55:45.633 AM	INFO	org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl	
Memory usage of ProcessTree 1672 for container-id container_1468572612013_0001_01_000002:
281.3 MB of 1 GB physical memory used; 3.3 GB of 2.1 GB virtual memory used
Jul 15, 10:55:48.668 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl Memory usage of ProcessTree 1672 for container-id container_1468572612013_0001_01_000002:
231.4 MB of 1 GB physical memory used; 3.4 GB of 2.1 GB virtual memory used
Jul 15, 10:55:51.699 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
Memory usage of ProcessTree 1672 for container-id container_1468572612013_0001_01_000002:
299.4 MB of 1 GB physical memory used; 3.4 GB of 2.1 GB virtual memory used
Jul 15, 10:55:54.735 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl Memory usage of ProcessTree 1672 for container-id container_1468572612013_0001_01_000002:
264.5 MB of 1 GB physical memory used; 3.2 GB of 2.1 GB virtual memory used
Jul 15, 10:55:57.153 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl Stopping container with container Id: container_1468572612013_0001_01_000002
Jul 15, 10:55:57.156 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container Container container_1468572612013_0001_01_000002 transitioned from RUNNING to KILLING Jul 15, 10:55:57.156 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch Cleaning up container container_1468572612013_0001_01_000002 Jul 15, 10:55:57.158 AM INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger USER=msgnet IP=10.252.0.14 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1468572612013_0001 CONTAINERID=container_1468572612013_0001_01_000002 Jul 15, 10:55:57.182 AM WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor Exit code from container container_1468572612013_0001_01_000002 is : 143 Jul 15, 10:55:57.244 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container Container container_1468572612013_0001_01_000002 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL Jul 15, 10:55:57.247 AM INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor Deleting absolute path : /hdp01/yarn/nm/usercache/msgnet/appcache/application_1468572612013_0001/container_1468572612013_0001_01_000002 Jul 15, 10:55:57.253 AM INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger USER=msgnet OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1468572612013_0001 CONTAINERID=container_1468572612013_0001_01_000002 Jul 15, 10:55:57.255 AM INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container Container container_1468572612013_0001_01_000002 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE

 

This is the job XML (copy/paste from HUE):

 

<workflow-app name="Caricamento_Log_Testo" xmlns="uri:oozie:workflow:0.5">
    <start to="shell-b728"/>
    <action name="shell-b728">
        <shell xmlns="uri:oozie:shell-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <exec>lib/move_scadenza_smartnumber.sh</exec>
              <capture-output/>
        </shell>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="End"/>
</workflow-app>

Amongst the things I've tried to do to get past the problem (with no success), I have tried to put the following in YARN Configuration (yarn-site.xml safety valve, MAYBE I HAD TO PUT THESE CONFIGURATIONS SOMEWHERE ELSE?):

 

<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
   <description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
   <name>yarn.nodemanager.vmem-pmem-ratio</name>
   <value>4</value>
   <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>

Any ideas? I ran out of them and I'm a bit frustrated by the issue. Thanks a lot for your help!

 

 

3 REPLIES 3

Re: Oozie Workflow Shell Action getting killed - Memory problem?

Rising Star

I forgot to paste the Shell Script that I'm trying to execute:

 

#!/bin/bash
# Script move output Parquet Files into Table
################################################################

export DATA_DIR=/user/path_from
export DEST_DIR=/user/hive/dest_db/dest_table

# Loops over "parts" files
# Please note that just the list of clean file names is processed here
# (substring after last occurrence of "/")
################################################################
hdfs dfs -ls $DATA_DIR/part-m-?????.snappy.parquet | awk '{print $8}' | sed 's/.*\///'

OUT=$?

if [ $OUT -eq 0 ];then
   hdfs dfs -ls $DATA_DIR/part-m-?????.snappy.parquet | awk '{print $8}' | sed 's/.*\///' | while read FILETOMOVE

   do
      hdfs dfs -mv $DATA_DIR/$FILETOMOVE $DEST_DIR/smartnumber_log_`date '+%Y%m%d%H%M%S%3N'`.snappy.parquet
   done

else

   exit $?

fi
Highlighted

Re: Oozie Workflow Shell Action getting killed - Memory problem?

Rising Star

Update: if I insert a "sleep 60" as first instruction in the shell script, the total time before the Oozie Workflow crash happens one minute later (two minutes later with "sleep 120", etc).

 

Therefore, the shell script definitely gets loaded and executed in the container. The crash happens inside the shell script.

 

In a matter of fact, after the "sleep" finding, inside the script I've substituted the

 

hdfs dfs -mv ...

command with the following:

 

hdfs dfs -ls ...

and the Workflow SUCCEEDS (no 143 exit code, green light on the Hue interface).

 

Of course it is useless with a "-ls" instead of a "-mv", but at this point I believe it's a permissions issues, even if I have no direct evidence of it in any of the logs... More confused than ever.

 

Still need help: can I evecute the "hdfs dfs -mv" command as an arbitrary user e.g. "hdfs" ?

 

Re: Oozie Workflow Shell Action getting killed - Memory problem?

Rising Star

I ended up changing the permissions to 777 on both the "source" and "destination" directories on HDFS.

 

Of course, I have limited understanding of all the security implications going on behind the scenes, but this seems to me a bug and not a feature. If I log in to Hue as a particular user, and this user has the correct permissions granted to access the relevant Directories in RW, I don't see why I have to be obligated to change permissions to 777 to let the Container be able to do the job.

 

Therefore I'm not marking this post as "resolved", even though I could work around the issue somehow (in a very bad way, actually)