Support Questions

arunak · ‎06-08-2016

Hi,

One of the spark application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

Is there any other way of achieving this?

bleonhardi · ‎06-08-2016

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

View solution in original post

arunak · ‎06-08-2016

One single small file.

rajkumar_singh · ‎06-08-2016

with spark-submit you can try uploading the file to driver using -Dapplication.properties.file=<file path on location>

arunak · ‎06-08-2016

@Rajkumar Singh, don't the application.properties.file need to be in a key value format?

clukasik · ‎06-08-2016

If you are using yarn-client mode and that file resides where the driver JVM is running, then it should work using "file://". Otherwise, as Jitendra suggests, copy the file to hdfs.

arunak · ‎06-08-2016

Thanks @clukasik. Is there any performance difference in choosing client deploy-mode over the cluster mode.If I use the default client deploy mode, I get the control on where my driver program runs. However, wanted to be sure that it does not cause any performance issue.

clukasik · ‎06-08-2016

I don't think that there would not be a performance difference. Of course, if you are using "collect()" or some such method that aggregates data in the driver JVM you will have to be mindful of driver-related properties and settings (e.g. --driver-memory). @Jitendra Yadav - do you see any performance concerns with client vs cluster?

jyadav · ‎06-08-2016

@clukasik I don't see any performance issue if running it on yarn-client mode however as per initial info they needs to use distributed cache kind of thing in spark, which they can achieve through SparkContext.addFile()

arunak · ‎06-08-2016

Thank You @clukasik and @Jitendra Yadav. Appreciate your help.

bleonhardi · ‎06-08-2016

spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

arunak · ‎06-08-2016

@Benjamin Leonhardi. Thanks for pointing this out. I over looked this flag.

Cloudera Community

Support Questions

Loading Local File to Apache Spark

Parsing Apache Log Files with Spark

Feature Releases of Apache Spark 3 minor versions

ORC Improvements for Apache Spark 2.2

Apache Spark - Apache HBase Connector

Spark Load/Performance Testing using Gatling – PAR...

HDF 3.1: Executing Apache Spark via ExecuteSparkIn...

How to Spark Roll Event Log Files in CDP

json file input path for loading into spark

Data Ingest with Apache Zeppelin + Apache Spark 1....

HDP 2.6.4 - HDF 3.1: Apache Kafka - Apache Spark S...