Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

JavaSerializer vs. KryoSerializer

avatar
Expert Contributor

Hi, I'm experimenting with a small CDH virtual cluster and I'm facing issues with serializers.

 

More specifically, I'm trying things with the "pyspark.mllib.fpm.FPGrowth" class (Machine Learning). This is what I'm trying to do:

 

from pyspark.mllib.fpm import FPGrowth

fp_model = FPGrowth.train(my_words_DF,0.3)
freqset = sorted(fp_model.freqItemsets().collect())

 

At this point, according to what Serializer I have configured in Spark, I have 2 different outcomes.

 

If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.JavaSerializer" then everyhting works FINE.

 

If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.KryoSerializer" (which is the DEFAULT setting, by the way), when I collect the "freqItemsets" I get the following exception:

 

com.esotericsoftware.kryo.KryoException:
java.lang.IllegalArgumentException:
Can not set final scala.collection.mutable.ListBuffer field
org.apache.spark.mllib.fpm.FPTree$Summary.nodes
to scala.collection.mutable.ArrayBuffer

This exception is confirmed to be a consequence of an unresolved bug "using Kryo with FPGrowth" in the following thread:

 

https://issues.apache.org/jira/browse/SPARK-7483

 

Now my questions:

 

1) What is really the difference between Kryo and Java Serializers?

 

2) Is one Serializer definitely better in most use cases, and if yes which one?

 

3) Is it possible to dinamically switch between the 2 Serializers without exiting my Spark session and/or changing/redeploying Spark Configuration in Cloudera Manager? How?

 

Thank in advance for any insight

 

 

 

 

1 ACCEPTED SOLUTION

avatar
New Contributor

Hi,

 

I faced EXACT same issue. to fix this use 

conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))

in your code. However, you won't get an error but may be incorrect results. I say incorrect because, for some minsupport values it runs fine but for many others it isn't, especially if the support value is low.

 

I would recommend you to use Java serializer despite it being inefficient. If you could upgrade to spark 1.5.2 or 1.6 it is even better, as some bug fixes have been made in these newer versions.

 

Here is a link on serialization:

 

http://spark.apache.org/docs/latest/tuning.html#data-serialization

 

 

View solution in original post

2 REPLIES 2

avatar
New Contributor

Hi,

 

I faced EXACT same issue. to fix this use 

conf.registerKryoClasses(Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]]))

in your code. However, you won't get an error but may be incorrect results. I say incorrect because, for some minsupport values it runs fine but for many others it isn't, especially if the support value is low.

 

I would recommend you to use Java serializer despite it being inefficient. If you could upgrade to spark 1.5.2 or 1.6 it is even better, as some bug fixes have been made in these newer versions.

 

Here is a link on serialization:

 

http://spark.apache.org/docs/latest/tuning.html#data-serialization

 

 

avatar
Expert Contributor

Thanks for the suggestion about registering the class and for the additional info. I have to say that I had already read the link you sent me, but I didn't really get what was the meaning of "you have to register the classes first". In fact, I gave up with my tries at the time and sticked to Java Serializer for my testing purposes.

 

Maybe I'll need to get back to this in the future, and I'll do it with additional knowledge now. Thanks a lot.

 

Also, I have to say that now, after reading all this, I find it a bit strange that Cloudera sets Kryo as default serializer. Anyway.