Posts: 16
Registered: ‎06-19-2014

Hadoop Distributed Shell Example - how to retrieve results from appCache?

[ Edited ]

So I built the distributed shell example and am using it to run a jar file that does some processing and writes several .json files out to a results folder. I'm passing the path to the script ( in the shell_script parameter.


The script starts out with:


/usr/java/latest/bin/java -Xmx2G -classpath /<path>/MyMain.jar:<more jars>


The script had been using relative references to the main jar and all the dependencies but it wasn't finding them so I made the paths absolute. If I run the script from the command line in the folder where it resides it will create the results folder there. If I use a container to execute, then I see through logs (from MyMain.jar) that the results folder is being created in appCache but the folder is gone when processing completes; something like this:



I want to be able to run multiple containers in parallel, with all sharing the same main jar and dependencies, but each processing through a different set of input files and writing to a separate results folder.


  1. Is there a way to run the container in the context of the folder where the script is so I can keep relative paths instead of using absolute paths?
  2. Do I hook the container complete event and pull the results folder out of cache there or is there a better way that is already built into the infrastructure?

Distributed Shell Example source on GitHub 


(Crosspost to StackOverflow)

Posts: 1,896
Kudos: 433
Solutions: 303
Registered: ‎07-31-2013

Re: Hadoop Distributed Shell Example - how to retrieve results from appCache?

(For 1) - The script's PWD is typically the temporary container directory, such as 

/yarn/nm/usercache/harsh/appcache/application_1436935612068_0002/container_1436935612068_0002_01_000002, and within this directory all distributed-cache elements are typically symlinked inside - and these could be relatively referenced. Here's a typical listing output (ls -la . or ls -la $PWD) from the script run via a container:



total 24
drwxr-s---. 3 harsh yarn 4096 Jul 15 10:19 .
drwxr-s---. 5 harsh yarn 4096 Jul 15 10:19 ..
-rw-------. 1 harsh yarn  290 Jul 15 10:19 container_tokens
lrwxrwxrwx. 1 harsh yarn   91 Jul 15 10:19 -> /yarn/nm/usercache/harsh/appcache/application_1436935612068_0002/filecache/11/
-rwx------. 1 harsh yarn 1839 Jul 15 10:19
drwxr-s---. 2 harsh yarn 4096 Jul 15 10:19 tmp



Are you tweaking the example code to pass your required jars alongside your script? The default shell example does not distribute the app/AM jars to the shell containers, nor does it appear to offer a way to send along arbitrary files outside of the shell script/command.


(For 2) - Write results to HDFS instead, thats your distributed filesystem where you can persist computed results. YARN offers no "file collection" facilities, outside of the logging framework (stdout, stderr, and logger outputs).