Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Spark Stanadlone mode PipelineModel.save not storing the trained model properly.

Contributor

Dear All,

I am using the spark to train model and save it on file system as using org.apache.spark.ml.PipelineModel.save() .

This works fine when I am running the spark in local mode but when I am running the spark in standalone mode where 2 workers are there , it is training the model correctly but when saving on the file it is storing the partial result .And the file system I am using is Local one i.e file:// .

Could someone please point out what is the issue here.

And is it an issue to store the trained PipelineModel on local file system.

FYI ..

Same code works when I use HDFS file system.

Spark version I am using is : 2.0.0

Thanks in Advance ,

Param.

2 REPLIES 2

My guess here is that each worker writes the model to file://model-path/model-part on each of the two worker machines. So maybe there is a part of the model on both machines?

With HDFS the model-path is the same and hence the model will be completely saved. So for a distributed system to store and load data all workers need to be able to access the same data under the same path. That's why a distributed file system is the usual recommendation. Ignoring performance, replication and so on you should also be able to mount and and use the same path on a network file system (SAN, NAS, NFS, ...), however this is not recommended

Contributor

Thank you for the response ,

I agree with you completely , But My question even the partial result it not getting stored correctly .

Thanks,

Param.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.