Support Questions

Find answers, ask questions, and share your expertise

Loading Local File to Apache Spark

avatar
Super Collaborator

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

1 ACCEPTED SOLUTION

avatar
Master Guru

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

View solution in original post

22 REPLIES 22

avatar
Super Guru

@Benjamin Leonhardi how --files is differ from SparkContext.addFile() apart from the way we use them?

avatar
Super Collaborator

Difference is noticeable only when we run it in a cluster mode without actually knowing where the driver is. On the other case, if we know where the driver is set to launch, both methods are similar in action.

--files is a submit time parameter, main() can run anywhere and just need to know the file name. In code, I can refer to the file by a file:// call.

In case of addFile(), since this is a code level setting, the main() need to know the file location in order to perform the add() . As per the API doc, The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

avatar
New Contributor

Try this to access local file in Yarn Mode:

import subprocess

rdata =subprocess.check_output("cat /home/xmo3l1n2/pyspark/data/yahoo_stocks.csv", shell=True)

pdata=sc.parallelize(rdata.split('\n'))