So I built the distributed shell example and am using it to run a jar file that does some processing and writes several .json files out to a results folder. I'm passing the path to the script (run.sh) in the shell_script parameter.
The script starts out with:
/usr/java/latest/bin/java -Xmx2G -classpath /<path>/MyMain.jar:<more jars>
The script had been using relative references to the main jar and all the dependencies but it wasn't finding them so I made the paths absolute. If I run the script from the command line in the folder where it resides it will create the results folder there. If I use a container to execute, then I see through logs (from MyMain.jar) that the results folder is being created in appCache but the folder is gone when processing completes; something like this:
I want to be able to run multiple containers in parallel, with all sharing the same main jar and dependencies, but each processing through a different set of input files and writing to a separate results folder.
(For 1) - The script's PWD is typically the temporary container directory, such as
/yarn/nm/usercache/harsh/appcache/application_1436935612068_0002/container_1436935612068_0002_01_000002, and within this directory all distributed-cache elements are typically symlinked inside - and these could be relatively referenced. Here's a typical listing output (ls -la . or ls -la $PWD) from the script run via a container:
total 24 drwxr-s---. 3 harsh yarn 4096 Jul 15 10:19 . drwxr-s---. 5 harsh yarn 4096 Jul 15 10:19 .. -rw-------. 1 harsh yarn 290 Jul 15 10:19 container_tokens lrwxrwxrwx. 1 harsh yarn 91 Jul 15 10:19 ExecScript.sh -> /yarn/nm/usercache/harsh/appcache/application_1436935612068_0002/filecache/11/ExecScript.sh -rwx------. 1 harsh yarn 1839 Jul 15 10:19 launch_container.sh drwxr-s---. 2 harsh yarn 4096 Jul 15 10:19 tmp
Are you tweaking the example code to pass your required jars alongside your script? The default shell example does not distribute the app/AM jars to the shell containers, nor does it appear to offer a way to send along arbitrary files outside of the shell script/command.
(For 2) - Write results to HDFS instead, thats your distributed filesystem where you can persist computed results. YARN offers no "file collection" facilities, outside of the logging framework (stdout, stderr, and logger outputs).