I am using the spark to train model and save it on file system as using org.apache.spark.ml.PipelineModel.save() .
This works fine when I am running the spark in local mode but when I am running the spark in standalone mode where 2 workers are there , it is training the model correctly but when saving on the file it is storing the partial result .And the file system I am using is Local one i.e file:// .
Could someone please point out what is the issue here.
And is it an issue to store the trained PipelineModel on local file system.
My guess here is that each worker writes the model to file://model-path/model-part on each of the two worker machines. So maybe there is a part of the model on both machines?
With HDFS the model-path is the same and hence the model will be completely saved. So for a distributed system to store and load data all workers need to be able to access the same data under the same path. That's why a distributed file system is the usual recommendation. Ignoring performance, replication and so on you should also be able to mount and and use the same path on a network file system (SAN, NAS, NFS, ...), however this is not recommended