question Re: Loading Local File to Apache Spark in Support Questions

Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 20:26:33 GMT

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

Re: Loading Local File to Apache Spark

jyadav — Wed, 08 Jun 2016 20:30:22 GMT

@akeezhadath

You can place the file on HDFS and access the file through "hdfs:///path/file".

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 20:33:09 GMT

Thanks for the suggestion @Jitendra Yadav But, the file being small <~ 500 KB, I was thinking if we need to have that loaded to HDFS. Was looking for some "hack"

Re: Loading Local File to Apache Spark

rajkumar_singh — Wed, 08 Jun 2016 20:33:41 GMT

@akeezhadath spark assume the your file is on hdfs by default if you have not specified any uri(file:///,hdfs://,s3://) so it your file is on hdfs, you can refrenced it using absolute path like

sc.textFile("/user/xyz/data.txt")

Re: Loading Local File to Apache Spark

clukasik — Wed, 08 Jun 2016 20:33:52 GMT

If you are using yarn-client mode and that file resides where the driver JVM is running, then it should work using "file://". Otherwise, as Jitendra suggests, copy the file to hdfs.

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 20:36:24 GMT

@Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS.

Re: Loading Local File to Apache Spark

rajkumar_singh — Wed, 08 Jun 2016 20:40:38 GMT

is it single file or multiple small files?

Re: Loading Local File to Apache Spark

clukasik — Wed, 08 Jun 2016 20:41:02 GMT

@akeezhadath - depending on how you are using the file, you could consider broadcast variables (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables). However, if the data fits well into the RDD construct, then you might be better with loading it as normal (sc.textFile("file://some-path")).

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 20:44:39 GMT

One single small file.

Re: Loading Local File to Apache Spark

rajkumar_singh — Wed, 08 Jun 2016 20:46:37 GMT

with spark-submit you can try uploading the file to driver using -Dapplication.properties.file=<file path on location>

Re: Loading Local File to Apache Spark

jyadav — Wed, 08 Jun 2016 20:47:57 GMT

@akeezhadath

Kindly use below API to cache the file on all the nodes.

SparkContext.addFile()

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 20:53:37 GMT

Thanks @clukasik. Is there any performance difference in choosing client deploy-mode over the cluster mode.If I use the default client deploy mode, I get the control on where my driver program runs. However, wanted to be sure that it does not cause any performance issue.

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 20:55:46 GMT

@clukasik, Thank You, I have had a look at broadcast variables. But I guess with the current requirement, I just require the RDD.

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 20:58:25 GMT

Thanks @Jitendra Yadav. I will take a look at the addFile API. I would like to try getting control on the driver as clukasik pointed out.

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 21:08:15 GMT

@Rajkumar Singh, don't the application.properties.file need to be in a key value format?

Re: Loading Local File to Apache Spark

clukasik — Wed, 08 Jun 2016 21:19:10 GMT

I don't think that there would not be a performance difference. Of course, if you are using "collect()" or some such method that aggregates data in the driver JVM you will have to be mindful of driver-related properties and settings (e.g. --driver-memory). @Jitendra Yadav - do you see any performance concerns with client vs cluster?

Re: Loading Local File to Apache Spark

jyadav — Wed, 08 Jun 2016 21:34:53 GMT

@clukasik I don't see any performance issue if running it on yarn-client mode however as per initial info they needs to use distributed cache kind of thing in spark, which they can achieve through SparkContext.addFile()

Re: Loading Local File to Apache Spark

bleonhardi — Wed, 08 Jun 2016 21:58:47 GMT

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

Re: Loading Local File to Apache Spark

arunak — Wed, 08 Jun 2016 22:05:20 GMT

@Benjamin Leonhardi. Thanks for pointing this out. I over looked this flag.

Re: Loading Local File to Apache Spark

jyadav — Wed, 08 Jun 2016 22:14:35 GMT

@Benjamin Leonhardi how --files is differ from SparkContext.addFile() apart from the way we use them?