I am using the spark to train model and save it on file system as using org.apache.spark.ml.PipelineModel.save() .
This works fine when I am running the spark in local mode but when I am running the spark in standalone mode where 2 workers are there , it is training the model correctly but when saving on the file it is storing the partial result .And the file system I am using is Local one i.e file:// .
Could someone please point out what is the issue here.
And is it an issue to store the trained PipelineModel on local file system.
Same code works when I use HDFS file system.
Spark version I am using is : 2.0.0
Thanks in Advance ,
My guess here is that each worker writes the model to file://model-path/model-part on each of the two worker machines. So maybe there is a part of the model on both machines?
With HDFS the model-path is the same and hence the model will be completely saved. So for a distributed system to store and load data all workers need to be able to access the same data under the same path. That's why a distributed file system is the usual recommendation. Ignoring performance, replication and so on you should also be able to mount and and use the same path on a network file system (SAN, NAS, NFS, ...), however this is not recommended
Thank you for the response ,
I agree with you completely , But My question even the partial result it not getting stored correctly .