Created 06-08-2016 01:26 PM
Hi,
One of the spark application depends on a local file for some of its business logics.
We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.
Is there any other way of achieving this?
Created 06-08-2016 02:58 PM
spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.
Alternatively as the others have suggested put it in HDFS
Created 06-08-2016 01:44 PM
One single small file.
Created 06-08-2016 01:46 PM
with spark-submit you can try uploading the file to driver using -Dapplication.properties.file=<file path on location>
Created 06-08-2016 02:08 PM
@Rajkumar Singh, don't the application.properties.file need to be in a key value format?
Created 06-08-2016 01:33 PM
If you are using yarn-client mode and that file resides where the driver JVM is running, then it should work using "file://". Otherwise, as Jitendra suggests, copy the file to hdfs.
Created 06-08-2016 01:53 PM
Thanks @clukasik. Is there any performance difference in choosing client deploy-mode over the cluster mode.If I use the default client deploy mode, I get the control on where my driver program runs. However, wanted to be sure that it does not cause any performance issue.
Created 06-08-2016 02:19 PM
I don't think that there would not be a performance difference. Of course, if you are using "collect()" or some such method that aggregates data in the driver JVM you will have to be mindful of driver-related properties and settings (e.g. --driver-memory). @Jitendra Yadav - do you see any performance concerns with client vs cluster?
Created 06-08-2016 02:34 PM
@clukasik I don't see any performance issue if running it on yarn-client mode however as per initial info they needs to use distributed cache kind of thing in spark, which they can achieve through SparkContext.addFile()
Created 06-08-2016 06:46 PM
Thank You @clukasik and @Jitendra Yadav. Appreciate your help.
Created 06-08-2016 02:58 PM
spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.
Alternatively as the others have suggested put it in HDFS
Created 06-08-2016 03:05 PM
@Benjamin Leonhardi. Thanks for pointing this out. I over looked this flag.