Support Questions
Find answers, ask questions, and share your expertise

read/write files on multiple file systems in Spark cluster mode from driver -> executors -> driver

New Contributor

This Spark SQL Guide → Data sources → Generic Load/Save Functions

described a very simple "local file system load of an example file" based on Spark (local mode).  


I am looking for an Scala example that demonstrates a workflow that exercises read/write to different file systems in Spark (cluster mode - HDFS).  For example, 

  1. Driver loads an input file from local file system
  2. Add a simple column using lit() and stores that DataFrame in cluster mode to HDFS
  3. Write a small subset of that DataFrame (few records) back to Driver's local file system to avoid writing large file scenario (which is an anti-pattern and avoiding)


The examples I found on the internet only uses simple paths without the explicit URI prefixes.

Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) was called, local stand alone vs cluster mode.   So a "filepath" will be read/write locally (file system) vs cluster mode HDFS, without these explicit URIs.


There are situations were a Spark program needs to deal with both local file system and cluster mode (big data) in the same Spark application, like producing a summary table stored on the local file system of the driver at the end.  


If there are any existing alternatives Spark documentation that provides examples of different URIs, I am happy to accept that as well. 


I am also open to other patterns pro/con discussion, knowing that there are concerns of writing "large" file sequentially on the driver's local file system, which is absolutely what this question is explicitly avoiding.   


For example: patterns such as, using hdfs getmerge as an extra step outside the Spark application flow to download the small dataframe.  The main challenge of this pattern is that I need pass some state information from the Spark application to another program like BASH that needs to detect if the Spark application terminated normally or threw errors or exceptions and also the location of the HDFS path of the table maybe runtime dependent so that information also need to be passed to this auxiliary program.