Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark on Yarn: Do nodes need Spark installed?

avatar
Expert Contributor

My team needs Spark v2.3 for new features. We have HDP 2.6.3 installed which has Spark 2.0 (Correction:2.2.0) within stack.

Is that enough to comply such version requirement if I use a docker container as Spark Driver which has Spark 2.3 and configure it so as to use Yarn of current HDP installation?

Or do i need all workers Spark 2.3 installed?

The thing I need to understand is does workers (or nodemanagers) need new Spark libraries once job is submitted to Yarn?

Following note in Spark Cluster overview page led me to think it may not be mandatory: "The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime."

Thanks in advance...

1 ACCEPTED SOLUTION

avatar
Super Collaborator

You should be able to get Spark 2.3 and only "install" it on the edge node. Since spark-submit will essentially start a YARN job, it will distribute the resources needed at runtime. One thing to note is that the external shuffle service will still be using the HDP-installed lib, but that should be fine.

More info here: https://spark.apache.org/docs/latest/running-on-yarn.html ; keep in mind that if you get stuck, Hortonworks support will likely be limited in the amount of assistance (given the HDP installed version of Spark is what is within the support scope).

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

You should be able to get Spark 2.3 and only "install" it on the edge node. Since spark-submit will essentially start a YARN job, it will distribute the resources needed at runtime. One thing to note is that the external shuffle service will still be using the HDP-installed lib, but that should be fine.

More info here: https://spark.apache.org/docs/latest/running-on-yarn.html ; keep in mind that if you get stuck, Hortonworks support will likely be limited in the amount of assistance (given the HDP installed version of Spark is what is within the support scope).

avatar

@Sedat Kestepe

The supported spark versions with HDP 2.6.3 are spark 2.2.0/1.6.3. Other versions may or may not work, and we definitely don't recommend using other versions especially in production environments.

Spark client does not need to be installed in all the cluster worker nodes, only on the edge nodes that submit the application to the cluster.

As far as jar files and whether those are included or not in your application. I agree with above statement, you should avoid adding hadoop/spark library jars to your application as good practice to avoid version mismatch issues.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

avatar
Expert Contributor

Thanks for your answer and also for the warning about the version in the stack. Current Spark2 version is 2.2.0. I am going to correct it on question.

And also both answers are good news to me. Thanks again.