Support Questions

arunak · ‎06-08-2016

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

bleonhardi · ‎06-08-2016

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

View solution in original post

jyadav · ‎06-08-2016

@akeezhadath

You can place the file on HDFS and access the file through "hdfs:///path/file".

arunak · ‎06-08-2016

Thanks for the suggestion @Jitendra Yadav But, the file being small <~ 500 KB, I was thinking if we need to have that loaded to HDFS. Was looking for some "hack"

clukasik · ‎06-08-2016

@akeezhadath - depending on how you are using the file, you could consider broadcast variables (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables). However, if the data fits well into the RDD construct, then you might be better with loading it as normal (sc.textFile("file://some-path")).

arunak · ‎06-08-2016

@clukasik, Thank You, I have had a look at broadcast variables. But I guess with the current requirement, I just require the RDD.

jyadav · ‎06-08-2016

@akeezhadath

Kindly use below API to cache the file on all the nodes.

SparkContext.addFile()

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

arunak · ‎06-08-2016

Thanks @Jitendra Yadav. I will take a look at the addFile API. I would like to try getting control on the driver as clukasik pointed out.

rajkumar_singh · ‎06-08-2016

@akeezhadath spark assume the your file is on hdfs by default if you have not specified any uri(file:///,hdfs://,s3://) so it your file is on hdfs, you can refrenced it using absolute path like

sc.textFile("/user/xyz/data.txt")

arunak · ‎06-08-2016

@Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS.

rajkumar_singh · ‎06-08-2016

is it single file or multiple small files?

Cloudera Community

Support Questions

Loading Local File to Apache Spark