Created 03-07-2017 04:21 PM
I have been trying to change the data serializer for Spark jobs running in my HortonWorks Sandbox (v2.5) from the default Java Serializer to the Kryo Serializer, as suggested in multiple places (e.g. Here, and more specifically Here). I tried editing the /usr/hdp/current/spark-client/conf/spark-env.sh and the /usr/hdp/current/spark-historyserver/conf/spark-env.sh files by including the following:
SPARK_JAVA_OPTS+=' -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator ' export SPARK_JAVA_OPTS
as recommended Here (near the bottom of the page). However, when I restart Spark using Ambari, these files get overwritten and revert back to their original form (i.e., without the above JAVA_OPTS lines). I looked at other questions and posts about this topic, and all of them just recommend using Kryo Serialization without saying how to do it, especially within a HortonWorks Sandbox.
I have been using Zeppelin Notebooks to play around with Spark and build some training pages. Performance is not yet noticeably diminished, but I would like to follow best practices, and this seems to be one of them that I can't crack. I have also looked around the Spark Configs page, and it is not clear how to include this as a configuration.
How do I make Kryo the serializer of choice for my Spark instance in HDP 2.5 SandBox (residing inside of a VIrtualBox VM on my Windows 10 laptop, if it matters :)). I think that I see how to set it when spinning up a Spark Shell (or PySpark Shell) using the appropriate configurations on the Spark Context, but I don't want to have to do that every time I start using Spark, or Zeppelin with the Spark Interpreter.
Created 03-09-2017 06:21 PM
According to the situation, it seems that you are asking how to set the parameter with Ambari.
However, when I restart Spark using Ambari, these files get overwritten and revert back to their original form (i.e., without the above JAVA_OPTS lines).
You should put the parameter via Ambari.
1. Visit your Ambari (e.g., http://hdp26-1:8080/)
2. Click Spark2 in the left pane.
3. Client `Configs` in Spark2 page.
4. In "Advanced spark2-env", find "content". Then, you can see the same content of `spark-env.sh` managed by Ambari.
Could you try the above?
Created 03-09-2017 05:26 PM
The official Spark Documentation says this:
The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type
Link:
http://spark.apache.org/docs/latest/tuning.html#data-serialization
Created 03-09-2017 06:21 PM
According to the situation, it seems that you are asking how to set the parameter with Ambari.
However, when I restart Spark using Ambari, these files get overwritten and revert back to their original form (i.e., without the above JAVA_OPTS lines).
You should put the parameter via Ambari.
1. Visit your Ambari (e.g., http://hdp26-1:8080/)
2. Click Spark2 in the left pane.
3. Client `Configs` in Spark2 page.
4. In "Advanced spark2-env", find "content". Then, you can see the same content of `spark-env.sh` managed by Ambari.
Could you try the above?
Created 03-09-2017 06:49 PM
Thanks so much! That worked. The same solution works for Spark 1.6 operating within HDP 2.5, which is what I was using.
Created 03-09-2017 06:51 PM
Great! @Evan Willett
Created 10-11-2017 03:13 PM
Hi @Evan Willett could you plz share steps for what are you did?