Member since
01-13-2017
4
Posts
1
Kudos Received
0
Solutions
02-16-2017
06:29 AM
Yup, people should already be very carefull about it. On the other hand, there are people with older CDH version with no Spark2 support available, or just trying to figure out if a vanilla(newer) version of spark has some bug(s) fixed, or whatever any other reason that works for them. Regards.
... View more
02-16-2017
06:20 AM
1 Kudo
Hi guys, Thanks to @Deenar Toraskar, CFA for the following guide, I'm sharing it with all of you. Extracted from: https://www.linkedin.com/pulse/running-spark-2xx-cloudera-hadoop-distro-cdh-deenar-toraskar-cfa The Guide: Spark 2.0 has just been released and has many features that make Spark easier, faster, and smarter. The latest Cloudera Hadoop distribution (CDH 5.8.0) currently ships with Spark 1.6 or you may be running an earlier version of CDH. This post show how to run Spark 2.0 on your CDH cluster. Since Spark can be run as a YARN application it is possible to run a Spark version other than the one that comes bundled with the Cloudera distribution. This requires no administrator privileges and no changes to the cluster configuration and can be done by any user who has permission to run a YARN job on the cluster. A YARN application ships over all it’s dependencies over to the cluster for each invocation. You can run multiple Spark versions simultaneously on a YARN cluster. Each version of Spark is self contained in in the user workspace on the Edge node. Running a new Spark version will not affect any other jobs running on your cluster. Find the version of CDH and Hadoop running on your cluster using $ hadoop version Hadoop 2.6.0-cdh5.4.8 Download Spark and extract the sources. Pre built Spark binaries should work out of the box with most CDH versions, unless there are custom fixes in your CDH build in which case you can use the spark-2.0.0-bin-without-hadoop.tgz. (Optional) You can also build Spark by opening the distribution directory in the shell and running the following command using the CDH and Hadoop version from step 1 $ ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn Note: With Spark 2.0 the default build uses Scala version 2.11. If you need to stick to Scala 2.10, use the -Dscala-2.10 property or $ ./dev/change-scala-version.sh 2.10 Note that -Phadoop-provided enables the profile to build the assembly without including Hadoop-ecosystem dependencies provided by Cloudera. Extract the tgz file. $tar -xvzf /path/to/spark-2.0.0-bin-hadoop2.6.tgz cd into the custom Spark distribution and configure the custom Spark distribution by copying the configuration from your current Spark version $ cp -R /etc/spark/conf/* conf/ $ cp /etc/hive/conf/hive-site.xml conf/ Change SPARK_HOME to point to folder with the Spark 2.0 distribution $ sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh Change spark.master to yarn from yarn-client in spark-defaults.conf $ sed -i 's/spark.master=yarn-client/spark.master=yarn/' conf/spark-defaults.con Delete spark.yarn.jar from spark-defaults.conf $ sed '-i /spark.yarn.jar/d' conf/spark-defaults.conf Finally test your new Spark installation: $ ./bin/run-example SparkPi 10 --master yarn $ ./bin/spark-shell --master yarn $ ./bin/pyspark Update log4j.properties to suppress annoying warnings. add the following to conf/log4j.properties echo "log4j.logger.org.spark_project.jetty=ERROR" >> conf/log4j.properties [Optional but highly recommended steps - set either spark.yarn.archive or spark.yarn.jars]. Demise of assemblies. Spark 2.0 is moving away from using the huge assembly file to a directory full of jars to distribute its dependencies. See Spark-11157 andhttps://issues.apache.org/jira/secure/attachment/12767129/no-assemblies.pdffor more information. As a result spark.yarn.jar is now superseded by spark.yarn.jars or spark.yarn.archive. Tell YARN which Spark JAR to use. Make a Spark Yarn archive or copy Spark jars to hdfs $ Cd $SPARK_HOME $ zip spark-archive.zip jars/* $ hadoop fs -copyFromLocal spark-archive.zip $ echo "spark.yarn.archive=hdfs:///nameservice1/user/<yourusername>/spark-archive.zip" >> conf/spark-defaults.conf OR set "spark.yarn.jars" $ Cd $SPARK_HOME $ hadoop fs mkdir spark-2.0.0-bin-hadoop $hadoop fs -copyFromLocal jars/* spark-2.0.0-bin-hadoop $ echo "spark.yarn.jars=hdfs:///nameservice1/user/<yourusername>/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf If you do have access to the local directories of all the nodes in your cluster you can copy the archive or spark jars to the local directory of each of the data nodes using rsync or scp. Just update the URLs from hdfs:/ to local:
... View more