Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hadoop Distributed Shell Example - how to retrieve results from appCache?

Hadoop Distributed Shell Example - how to retrieve results from appCache?

Explorer

So I built the distributed shell example and am using it to run a jar file that does some processing and writes several .json files out to a results folder. I'm passing the path to the script (run.sh) in the shell_script parameter.

 

The script starts out with:

 

/usr/java/latest/bin/java -Xmx2G -classpath /<path>/MyMain.jar:<more jars>

 

The script had been using relative references to the main jar and all the dependencies but it wasn't finding them so I made the paths absolute. If I run the script from the command line in the folder where it resides it will create the results folder there. If I use a container to execute, then I see through logs (from MyMain.jar) that the results folder is being created in appCache but the folder is gone when processing completes; something like this:

/yarn/nm/usercache/cloudera/appcache/<applicationId>/<containerId>/results

 

I want to be able to run multiple containers in parallel, with all sharing the same main jar and dependencies, but each processing through a different set of input files and writing to a separate results folder.

 

  1. Is there a way to run the container in the context of the folder where the script is so I can keep relative paths instead of using absolute paths?
  2. Do I hook the container complete event and pull the results folder out of cache there or is there a better way that is already built into the infrastructure?


Distributed Shell Example source on GitHub 

 

(Crosspost to StackOverflow)

1 REPLY 1

Re: Hadoop Distributed Shell Example - how to retrieve results from appCache?

Master Guru

(For 1) - The script's PWD is typically the temporary container directory, such as 

/yarn/nm/usercache/harsh/appcache/application_1436935612068_0002/container_1436935612068_0002_01_000002, and within this directory all distributed-cache elements are typically symlinked inside - and these could be relatively referenced. Here's a typical listing output (ls -la . or ls -la $PWD) from the script run via a container:

 

 

total 24
drwxr-s---. 3 harsh yarn 4096 Jul 15 10:19 .
drwxr-s---. 5 harsh yarn 4096 Jul 15 10:19 ..
-rw-------. 1 harsh yarn  290 Jul 15 10:19 container_tokens
lrwxrwxrwx. 1 harsh yarn   91 Jul 15 10:19 ExecScript.sh -> /yarn/nm/usercache/harsh/appcache/application_1436935612068_0002/filecache/11/ExecScript.sh
-rwx------. 1 harsh yarn 1839 Jul 15 10:19 launch_container.sh
drwxr-s---. 2 harsh yarn 4096 Jul 15 10:19 tmp

 

 

Are you tweaking the example code to pass your required jars alongside your script? The default shell example does not distribute the app/AM jars to the shell containers, nor does it appear to offer a way to send along arbitrary files outside of the shell script/command.

 

(For 2) - Write results to HDFS instead, thats your distributed filesystem where you can persist computed results. YARN offers no "file collection" facilities, outside of the logging framework (stdout, stderr, and logger outputs).

Don't have an account?
Coming from Hortonworks? Activate your account here