Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Loading Local File to Apache Spark

avatar
Super Collaborator

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

1 ACCEPTED SOLUTION

avatar
Master Guru

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

View solution in original post

22 REPLIES 22

avatar
Super Collaborator

One single small file.

avatar
Super Guru

with spark-submit you can try uploading the file to driver using -Dapplication.properties.file=<file path on location>

avatar
Super Collaborator

@Rajkumar Singh, don't the application.properties.file need to be in a key value format?

avatar
Super Collaborator

If you are using yarn-client mode and that file resides where the driver JVM is running, then it should work using "file://". Otherwise, as Jitendra suggests, copy the file to hdfs.

avatar
Super Collaborator

Thanks @clukasik. Is there any performance difference in choosing client deploy-mode over the cluster mode.If I use the default client deploy mode, I get the control on where my driver program runs. However, wanted to be sure that it does not cause any performance issue.

avatar
Super Collaborator

I don't think that there would not be a performance difference. Of course, if you are using "collect()" or some such method that aggregates data in the driver JVM you will have to be mindful of driver-related properties and settings (e.g. --driver-memory). @Jitendra Yadav - do you see any performance concerns with client vs cluster?

avatar
Super Guru

@clukasik I don't see any performance issue if running it on yarn-client mode however as per initial info they needs to use distributed cache kind of thing in spark, which they can achieve through SparkContext.addFile()

avatar
Super Collaborator

Thank You @clukasik and @Jitendra Yadav. Appreciate your help.

avatar
Master Guru

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

avatar
Super Collaborator

@Benjamin Leonhardi. Thanks for pointing this out. I over looked this flag.