Support Questions

sundar_gampa · ‎08-17-2018

Hi All,

I am working on a spark java wrapper which uses third party libraries, which will read files from a hard coded directory name say "resdata" from where job executes. I know this is twisted but will try to explain.

when I execute the job it is trying to find the required files in the path something like this below,

/data<xx>/Hadoop/yarn/local/<userspecificpath>/appcache/application_xxxxx_xxx/container_00_xxxxx_xxx/resdata

I am assuming it is looking for the files in the current data directory , under that looking for directory name "resdata". At this point I don't know how to configure the current directory to any path on hdfs or local.

So looking for options to create directory structure similar to what the third party libraries expecting and copying required files over there. This I need to do on each node. I am working on spark 2.2.0

Please help me in achieving this?

Sundar Gampa

falbani · ‎08-17-2018

@Sundar Gampa

That path looks like the spark container working directory. Am I correct?

This is taken from yarn configuration property yarn.nodemanager.local-dirs

Out of the box spark provides ways to copy data to this directory by using --files --jar --archive arguments when running the spark-submit command.

You can read more about those here:

https://spark.apache.org/docs/latest/running-on-yarn.html

Having that said if you like to add the directory resdata you simply need to zip the files you like to be part of the directory and add the zip file as

spark-submit ... --files resdata.zip#resdata ...

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-21-2018

@Sundar Gampa If the above helped please remember to login and click the "accept" link on the answer.

Cloudera Community

Support Questions

Spark create a temp directory structure on each node