Support Questions

arunak · ‎06-08-2016

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

bleonhardi · ‎06-08-2016

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

View solution in original post

jyadav · ‎06-08-2016

@Benjamin Leonhardi how --files is differ from SparkContext.addFile() apart from the way we use them?

arunak · ‎06-08-2016

Difference is noticeable only when we run it in a cluster mode without actually knowing where the driver is. On the other case, if we know where the driver is set to launch, both methods are similar in action.

--files is a submit time parameter, main() can run anywhere and just need to know the file name. In code, I can refer to the file by a file:// call.

In case of addFile(), since this is a code level setting, the main() need to know the file location in order to perform the add() . As per the API doc, The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

mohan_indugula · ‎06-16-2017

Try this to access local file in Yarn Mode:

import subprocess

rdata =subprocess.check_output("cat /home/xmo3l1n2/pyspark/data/yahoo_stocks.csv", shell=True)

pdata=sc.parallelize(rdata.split('\n'))

Cloudera Community

Support Questions

Loading Local File to Apache Spark