Support Questions
Find answers, ask questions, and share your expertise

Loading Local File to Apache Spark

Solved Go to solution

Loading Local File to Apache Spark

Super Collaborator

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Loading Local File to Apache Spark

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

View solution in original post

22 REPLIES 22

Re: Loading Local File to Apache Spark

@akeezhadath

You can place the file on HDFS and access the file through "hdfs:///path/file".

Re: Loading Local File to Apache Spark

Super Collaborator

Thanks for the suggestion @Jitendra Yadav But, the file being small <~ 500 KB, I was thinking if we need to have that loaded to HDFS. Was looking for some "hack"

Re: Loading Local File to Apache Spark

Expert Contributor

@akeezhadath - depending on how you are using the file, you could consider broadcast variables (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables). However, if the data fits well into the RDD construct, then you might be better with loading it as normal (sc.textFile("file://some-path")).

Re: Loading Local File to Apache Spark

Super Collaborator

@clukasik, Thank You, I have had a look at broadcast variables. But I guess with the current requirement, I just require the RDD.

Re: Loading Local File to Apache Spark

@akeezhadath

Kindly use below API to cache the file on all the nodes.

SparkContext.addFile()

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

Re: Loading Local File to Apache Spark

Super Collaborator

Thanks @Jitendra Yadav. I will take a look at the addFile API. I would like to try getting control on the driver as clukasik pointed out.

Re: Loading Local File to Apache Spark

@akeezhadath spark assume the your file is on hdfs by default if you have not specified any uri(file:///,hdfs://,s3://) so it your file is on hdfs, you can refrenced it using absolute path like

sc.textFile("/user/xyz/data.txt")

Re: Loading Local File to Apache Spark

Super Collaborator

@Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS.

Re: Loading Local File to Apache Spark

is it single file or multiple small files?