Support Questions

Find answers, ask questions, and share your expertise

Loading Local File to Apache Spark

avatar
Super Collaborator

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

1 ACCEPTED SOLUTION

avatar
Master Guru

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

View solution in original post

22 REPLIES 22

avatar
Super Guru
@akeezhadath

You can place the file on HDFS and access the file through "hdfs:///path/file".

avatar
Super Collaborator

Thanks for the suggestion @Jitendra Yadav But, the file being small <~ 500 KB, I was thinking if we need to have that loaded to HDFS. Was looking for some "hack"

avatar
Super Collaborator

@akeezhadath - depending on how you are using the file, you could consider broadcast variables (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables). However, if the data fits well into the RDD construct, then you might be better with loading it as normal (sc.textFile("file://some-path")).

avatar
Super Collaborator

@clukasik, Thank You, I have had a look at broadcast variables. But I guess with the current requirement, I just require the RDD.

avatar
Super Guru

@akeezhadath

Kindly use below API to cache the file on all the nodes.

SparkContext.addFile()

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

avatar
Super Collaborator

Thanks @Jitendra Yadav. I will take a look at the addFile API. I would like to try getting control on the driver as clukasik pointed out.

avatar
Super Guru

@akeezhadath spark assume the your file is on hdfs by default if you have not specified any uri(file:///,hdfs://,s3://) so it your file is on hdfs, you can refrenced it using absolute path like

sc.textFile("/user/xyz/data.txt")

avatar
Super Collaborator

@Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS.

avatar
Super Guru

is it single file or multiple small files?