Created 06-08-2016 01:26 PM
Hi,
One of the spark application depends on a local file for some of its business logics.
We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.
Is there any other way of achieving this?
Created 06-08-2016 02:58 PM
spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.
Alternatively as the others have suggested put it in HDFS
Created 06-08-2016 01:30 PM
You can place the file on HDFS and access the file through "hdfs:///path/file".
Created 06-08-2016 01:33 PM
Thanks for the suggestion @Jitendra Yadav But, the file being small <~ 500 KB, I was thinking if we need to have that loaded to HDFS. Was looking for some "hack"
Created 06-08-2016 01:41 PM
@akeezhadath - depending on how you are using the file, you could consider broadcast variables (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables). However, if the data fits well into the RDD construct, then you might be better with loading it as normal (sc.textFile("file://some-path")).
Created 06-08-2016 01:55 PM
@clukasik, Thank You, I have had a look at broadcast variables. But I guess with the current requirement, I just require the RDD.
Created 06-08-2016 01:47 PM
Kindly use below API to cache the file on all the nodes.
SparkContext.addFile()
Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.
A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.
Created 06-08-2016 01:58 PM
Thanks @Jitendra Yadav. I will take a look at the addFile API. I would like to try getting control on the driver as clukasik pointed out.
Created 06-08-2016 01:33 PM
@akeezhadath spark assume the your file is on hdfs by default if you have not specified any uri(file:///,hdfs://,s3://) so it your file is on hdfs, you can refrenced it using absolute path like
sc.textFile("/user/xyz/data.txt")
Created 06-08-2016 01:36 PM
@Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS.
Created 06-08-2016 01:40 PM
is it single file or multiple small files?