Support Questions

willett_evan · ‎03-07-2017

I have been trying to change the data serializer for Spark jobs running in my HortonWorks Sandbox (v2.5) from the default Java Serializer to the Kryo Serializer, as suggested in multiple places (e.g. Here, and more specifically Here). I tried editing the /usr/hdp/current/spark-client/conf/spark-env.sh and the /usr/hdp/current/spark-historyserver/conf/spark-env.sh files by including the following:

SPARK_JAVA_OPTS+='
 -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
 -Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator '
export SPARK_JAVA_OPTS

as recommended Here (near the bottom of the page). However, when I restart Spark using Ambari, these files get overwritten and revert back to their original form (i.e., without the above JAVA_OPTS lines). I looked at other questions and posts about this topic, and all of them just recommend using Kryo Serialization without saying how to do it, especially within a HortonWorks Sandbox.

I have been using Zeppelin Notebooks to play around with Spark and build some training pages. Performance is not yet noticeably diminished, but I would like to follow best practices, and this seems to be one of them that I can't crack. I have also looked around the Spark Configs page, and it is not clear how to include this as a configuration.

How do I make Kryo the serializer of choice for my Spark instance in HDP 2.5 SandBox (residing inside of a VIrtualBox VM on my Windows 10 laptop, if it matters :)). I think that I see how to set it when spinning up a Spark Shell (or PySpark Shell) using the appropriate configurations on the Spark Context, but I don't want to have to do that every time I start using Spark, or Zeppelin with the Spark Interpreter.

dhyun · ‎03-09-2017

According to the situation, it seems that you are asking how to set the parameter with Ambari.

However, when I restart Spark using Ambari, these files get overwritten and revert back to their original form (i.e., without the above JAVA_OPTS lines).

You should put the parameter via Ambari.

1. Visit your Ambari (e.g., http://hdp26-1:8080/)

2. Click Spark2 in the left pane.

3. Client `Configs` in Spark2 page.

4. In "Advanced spark2-env", find "content". Then, you can see the same content of `spark-env.sh` managed by Ambari.

Could you try the above?

View solution in original post

adnanalvee · ‎03-09-2017

hi @Evan Willett

The official Spark Documentation says this:

The only reason Kryo is not the default is because of the custom
registration requirement, but we recommend trying it in any 
network-intensive application.


Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs
with simple types, arrays of simple types, or string type

Link:

http://spark.apache.org/docs/latest/tuning.html#data-serialization

dhyun · ‎03-09-2017

According to the situation, it seems that you are asking how to set the parameter with Ambari.

However, when I restart Spark using Ambari, these files get overwritten and revert back to their original form (i.e., without the above JAVA_OPTS lines).

You should put the parameter via Ambari.

1. Visit your Ambari (e.g., http://hdp26-1:8080/)

2. Click Spark2 in the left pane.

3. Client `Configs` in Spark2 page.

4. In "Advanced spark2-env", find "content". Then, you can see the same content of `spark-env.sh` managed by Ambari.

Could you try the above?

willett_evan · ‎03-09-2017

Thanks so much! That worked. The same solution works for Spark 1.6 operating within HDP 2.5, which is what I was using.

dhyun · ‎03-09-2017

Great! @Evan Willett

mahmoud_kamel10 · ‎10-11-2017

Hi @Evan Willett could you plz share steps for what are you did?

Cloudera Community

Support Questions

Using Kryo Serializer with Spark