Member since
03-07-2017
2
Posts
0
Kudos Received
0
Solutions
03-09-2017
06:49 PM
Thanks so much! That worked. The same solution works for Spark 1.6 operating within HDP 2.5, which is what I was using.
... View more
03-07-2017
04:21 PM
I have been trying to change the data serializer for Spark jobs running in my HortonWorks Sandbox (v2.5) from the default Java Serializer to the Kryo Serializer, as suggested in multiple places (e.g. Here, and more specifically Here). I tried editing the /usr/hdp/current/spark-client/conf/spark-env.sh and the /usr/hdp/current/spark-historyserver/conf/spark-env.sh files by including the following: SPARK_JAVA_OPTS+='
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator '
export SPARK_JAVA_OPTS as recommended Here (near the bottom of the page). However, when I restart Spark using Ambari, these files get overwritten and revert back to their original form (i.e., without the above JAVA_OPTS lines). I looked at other questions and posts about this topic, and all of them just recommend using Kryo Serialization without saying how to do it, especially within a HortonWorks Sandbox. I have been using Zeppelin Notebooks to play around with Spark and build some training pages. Performance is not yet noticeably diminished, but I would like to follow best practices, and this seems to be one of them that I can't crack. I have also looked around the Spark Configs page, and it is not clear how to include this as a configuration. How do I make Kryo the serializer of choice for my Spark instance in HDP 2.5 SandBox (residing inside of a VIrtualBox VM on my Windows 10 laptop, if it matters :)). I think that I see how to set it when spinning up a Spark Shell (or PySpark Shell) using the appropriate configurations on the Spark Context, but I don't want to have to do that every time I start using Spark, or Zeppelin with the Spark Interpreter.
... View more
Labels: