Created 06-05-2023 02:05 AM
My manager forces me to find a way to install and use Spark 3 on CDH 6.x cluster. Is there any change?
When I did some research, I found out that only CDP 7. supports Spark 3, and CDH 6.x only support Spark 2. But my manager said that you don't need to install Spark through Cloudera Manager, you can install Spark 3 separately (by downloading a tar from the internet or sth like that) and then find a way to make that Spark service connect with Cloudera service like Hive, HDFS,... (by copying the hive-site, hdfs-site,... to spark conf folder maybe?)
So does anyone have any experience with this? My manager is insane!!!!
Created 06-08-2023 11:41 PM
I've successfully setup Spark 3.3.0 on CDH 6.2 (we used YARN). Here are some important step
1. Back up the current spark come from Cloudera package (v2.4.0 I think) at /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark
2. Download the spark version from Spark homepage, for ex "spark-3.3.0-bin-hadoop3.tgz". Extract, delete old spark folder and replace with new spark folder (rename it to "spark") at /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark
3. Copy all the config files from old spark conf folder to the new spark conf folder
4. Copy the Yarn-related config file into spark conf folder too
4.1. Copy file spark-3.3.0-yarn-shuffle.jar from spark/yarn to spark/jars folder
5. Make some modifications to spark-default.conf file, mostly disable log and point to the right jar folder
6. Modify some yarn config like below (yarn-site.xml)
7. Restart the cluster and run spark-shell command. Run some queries for testing. You could modify the yarn-site.xml file in the spark conf folder directly to make sure.
Created 06-08-2023 11:41 PM
I've successfully setup Spark 3.3.0 on CDH 6.2 (we used YARN). Here are some important step
1. Back up the current spark come from Cloudera package (v2.4.0 I think) at /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark
2. Download the spark version from Spark homepage, for ex "spark-3.3.0-bin-hadoop3.tgz". Extract, delete old spark folder and replace with new spark folder (rename it to "spark") at /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark
3. Copy all the config files from old spark conf folder to the new spark conf folder
4. Copy the Yarn-related config file into spark conf folder too
4.1. Copy file spark-3.3.0-yarn-shuffle.jar from spark/yarn to spark/jars folder
5. Make some modifications to spark-default.conf file, mostly disable log and point to the right jar folder
6. Modify some yarn config like below (yarn-site.xml)
7. Restart the cluster and run spark-shell command. Run some queries for testing. You could modify the yarn-site.xml file in the spark conf folder directly to make sure.