Created on 08-18-2015 01:51 PM - edited 09-16-2022 02:38 AM
I went through the Slider Memcached Tutorial and was able to package/deploy/start the memcached container successfully; however when I package up a custom application, basically a Java jar plus dependencies, the container never launches succssfully.
The application page show the app is in a FINISHED/FAILED state with this diagnostic:
http://quickstart.cloudera:8088/cluster/app/application_1439926335194_0001
Diagnostics: Unstable Application Instance : - failed with component MYAPP failed 'recently' 6 times (4 in startup); threshold is 5 - last failure: Failure container_1439926335194_0001_01_000008 on host quickstart.cloudera (0): http://quickstart.cloudera:19888/jobhistory/logs//quickstart.cloudera:8041/container_1439926335194_0...
Part of the challenge in diagnosing the issue with the container is that the logs disappear after the application completes.
http://quickstart.cloudera:8042/node/containerlogs/container_1439926335194_0001_01_000001/MYUSER
There is a troubleshooting page for slider which indicates that you can persist the logs beyond application completion:
http://slider.incubator.apache.org/docs/troubleshooting.html
Configuring YARN for better debugging
One configuration to aid debugging is tell the nodemanagers to keep data for a short period after containers finish
<!-- 10 minutes after a failure to see what is left in the directory--> <property> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>600</value> </property>
And I found this setting in Yarn - Configuration - NodeManager Base Group - Advanced - Localized Dir Delection Delay and changed it from the default of 0 to 1200; however even after I deploy client config, and restart Nodemanager + Yarn, even restart the VM, the logs are still getting deleted on container completion.
I'm working on the CDH 5.3.0 Vitrualbox VM image and the cluster + services appear to be working normally as I start up the package.
Created 09-01-2015 01:22 PM
I found the container logs via the containers web UI (on Cloudera VM it is http://quickstart.cloudera:8042/node/allContainers)
INFO 2015-08-19 14:07:28,422 AgentToggleLogger.py:40 - Queue result: {'componentStatus': [], 'reports': [{'actionId': u'4-1', 'clusterName': u'myapp1', 'exitcode': 1, 'reportResult': True, 'role': u'MYAPP', 'roleCommand': u'START', 'serviceName': u'myapp1', 'status': 'FAILED', 'stderr': '2015-08-19 14:07:28,268 - Error while executing command ...<removed for brevity>, 'stdout': '2015-08-19 14:07:23,261 - Execute[\'/usr/java/latest/bin/java -Xmx256m -classpath ...<removed for brevity>, 'structuredOut': '{}', 'taskId': 4}]}
Created on 08-19-2015 11:13 AM - edited 08-19-2015 12:26 PM
Only error in the log I see is this:
Role instance RoleInstance failed
2015-08-19 10:59:21,819 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='MYAPP', id='container_1439926335194_0002_01_000003', container=ContainerID=container_1439926335194_0002_01_000003 nodeID=quickstart.cloudera:8041 http=quickstart.cloudera:8042 priority=1073741825 resource=<memory:1024, vCores:1>, createTime=1440007115649, startTime=1440007115674, released=false, roleId=1, host=quickstart.cloudera, hostURL=http://quickstart.cloudera:8042, state=5, placement=null, exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py --label container_1439926335194_0002_01_000003___MYAPP --zk-quorum localhost:2181 --zk-reg-path /registry/users/myuser/services/org-apache-slider/myapp1> /slider-agent.out 2>&1 ; ', diagnostics='', output=null, environment=[LANGUAGE="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", HADOOP_USER_NAME="C4", AGENT_LOG_ROOT="", PYTHONPATH="./infra/agent/slider-agent/", LC_ALL="en_US.UTF-8", SLIDER_PASSPHRASE="<redacted>", LANG="en_US.UTF-8"]} failed
Created 09-01-2015 01:22 PM
I found the container logs via the containers web UI (on Cloudera VM it is http://quickstart.cloudera:8042/node/allContainers)
INFO 2015-08-19 14:07:28,422 AgentToggleLogger.py:40 - Queue result: {'componentStatus': [], 'reports': [{'actionId': u'4-1', 'clusterName': u'myapp1', 'exitcode': 1, 'reportResult': True, 'role': u'MYAPP', 'roleCommand': u'START', 'serviceName': u'myapp1', 'status': 'FAILED', 'stderr': '2015-08-19 14:07:28,268 - Error while executing command ...<removed for brevity>, 'stdout': '2015-08-19 14:07:23,261 - Execute[\'/usr/java/latest/bin/java -Xmx256m -classpath ...<removed for brevity>, 'structuredOut': '{}', 'taskId': 4}]}
Created 09-02-2015 05:17 AM
Thank you for sharing the steps you took dr3x. Hopefully it will help others who face a similar issue in the future.