Created on 01-21-2016 02:35 AM - edited 09-16-2022 02:58 AM
Hi, I'm experimenting with a small CDH virtual cluster and I'm facing issues with serializers.
More specifically, I'm trying things with the "pyspark.mllib.fpm.FPGrowth" class (Machine Learning). This is what I'm trying to do:
from pyspark.mllib.fpm import FPGrowth fp_model = FPGrowth.train(my_words_DF,0.3) freqset = sorted(fp_model.freqItemsets().collect())
At this point, according to what Serializer I have configured in Spark, I have 2 different outcomes.
If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.JavaSerializer" then everyhting works FINE.
If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.KryoSerializer" (which is the DEFAULT setting, by the way), when I collect the "freqItemsets" I get the following exception:
com.esotericsoftware.kryo.KryoException:
java.lang.IllegalArgumentException:
Can not set final scala.collection.mutable.ListBuffer field
org.apache.spark.mllib.fpm.FPTree$Summary.nodes
to scala.collection.mutable.ArrayBuffer
This exception is confirmed to be a consequence of an unresolved bug "using Kryo with FPGrowth" in the following thread:
https://issues.apache.org/jira/browse/SPARK-7483
Now my questions:
1) What is really the difference between Kryo and Java Serializers?
2) Is one Serializer definitely better in most use cases, and if yes which one?
3) Is it possible to dinamically switch between the 2 Serializers without exiting my Spark session and/or changing/redeploying Spark Configuration in Cloudera Manager? How?
Thank in advance for any insight
Created 03-04-2016 04:29 PM
Hi,
I faced EXACT same issue. to fix this use
conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))
in your code. However, you won't get an error but may be incorrect results. I say incorrect because, for some minsupport values it runs fine but for many others it isn't, especially if the support value is low.
I would recommend you to use Java serializer despite it being inefficient. If you could upgrade to spark 1.5.2 or 1.6 it is even better, as some bug fixes have been made in these newer versions.
Here is a link on serialization:
http://spark.apache.org/docs/latest/tuning.html#data-serialization
Created 03-04-2016 04:29 PM
Hi,
I faced EXACT same issue. to fix this use
conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))
in your code. However, you won't get an error but may be incorrect results. I say incorrect because, for some minsupport values it runs fine but for many others it isn't, especially if the support value is low.
I would recommend you to use Java serializer despite it being inefficient. If you could upgrade to spark 1.5.2 or 1.6 it is even better, as some bug fixes have been made in these newer versions.
Here is a link on serialization:
http://spark.apache.org/docs/latest/tuning.html#data-serialization
Created on 03-06-2016 11:13 AM - edited 03-06-2016 11:14 AM
Thanks for the suggestion about registering the class and for the additional info. I have to say that I had already read the link you sent me, but I didn't really get what was the meaning of "you have to register the classes first". In fact, I gave up with my tries at the time and sticked to Java Serializer for my testing purposes.
Maybe I'll need to get back to this in the future, and I'll do it with additional knowledge now. Thanks a lot.
Also, I have to say that now, after reading all this, I find it a bit strange that Cloudera sets Kryo as default serializer. Anyway.