One of the spark application depends on a local file for some of its business logics.
We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.
Is there any other way of achieving this?
spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.
Alternatively as the others have suggested put it in HDFS
@akeezhadath - depending on how you are using the file, you could consider broadcast variables (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables). However, if the data fits well into the RDD construct, then you might be better with loading it as normal (sc.textFile("file://some-path")).
Kindly use below API to cache the file on all the nodes.
Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.
A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.
@akeezhadath spark assume the your file is on hdfs by default if you have not specified any uri(file:///,hdfs://,s3://) so it your file is on hdfs, you can refrenced it using absolute path like
@Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS.