Support Questions
Find answers, ask questions, and share your expertise

Spark Stanadlone mode PipelineModel.save not storing the trained model properly.

Contributor

Dear All,

I am using the spark to train model and save it on file system as using org.apache.spark.ml.PipelineModel.save() .

This works fine when I am running the spark in local mode but when I am running the spark in standalone mode where 2 workers are there , it is training the model correctly but when saving on the file it is storing the partial result .And the file system I am using is Local one i.e file:// .

Could someone please point out what is the issue here.

And is it an issue to store the trained PipelineModel on local file system.

FYI ..

Same code works when I use HDFS file system.

Spark version I am using is : 2.0.0

Thanks in Advance ,

Param.

2 REPLIES 2

My guess here is that each worker writes the model to file://model-path/model-part on each of the two worker machines. So maybe there is a part of the model on both machines?

With HDFS the model-path is the same and hence the model will be completely saved. So for a distributed system to store and load data all workers need to be able to access the same data under the same path. That's why a distributed file system is the usual recommendation. Ignoring performance, replication and so on you should also be able to mount and and use the same path on a network file system (SAN, NAS, NFS, ...), however this is not recommended

Contributor

Thank you for the response ,

I agree with you completely , But My question even the partial result it not getting stored correctly .

Thanks,

Param.

; ;