Created 06-08-2016 01:26 PM
Hi,
One of the spark application depends on a local file for some of its business logics.
We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.
Is there any other way of achieving this?
Created 06-08-2016 02:58 PM
spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.
Alternatively as the others have suggested put it in HDFS
Created 06-08-2016 03:14 PM
@Benjamin Leonhardi how --files is differ from SparkContext.addFile() apart from the way we use them?
Created 06-08-2016 06:56 PM
Difference is noticeable only when we run it in a cluster mode without actually knowing where the driver is. On the other case, if we know where the driver is set to launch, both methods are similar in action.
--files is a submit time parameter, main() can run anywhere and just need to know the file name. In code, I can refer to the file by a file:// call.
In case of addFile(), since this is a code level setting, the main() need to know the file location in order to perform the add() . As per the API doc, The path
passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
Created 06-16-2017 06:48 PM
Try this to access local file in Yarn Mode:
import subprocess
rdata =subprocess.check_output("cat /home/xmo3l1n2/pyspark/data/yahoo_stocks.csv", shell=True)
pdata=sc.parallelize(rdata.split('\n'))