Member since
06-02-2020
331
Posts
64
Kudos Received
49
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1092 | 07-11-2024 01:55 AM | |
3102 | 07-09-2024 11:18 PM | |
2666 | 07-09-2024 04:26 AM | |
2025 | 07-09-2024 03:38 AM | |
2324 | 06-05-2024 02:03 AM |
07-09-2024
11:26 PM
1 Kudo
Based on event log files, you need to adjust Spark History Server settings. Could you please check SHS cleanup is enabled or not. If you enable spark automatically it clean the old event log files. To load larger event log files, you need to adjust the DAEMON_MEMORY_SIZE. You can refer the following article to adjust the SHS parameters: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
... View more
07-09-2024
06:35 PM
Thanks, @RangaReddy . It solved my problem. 👏
... View more
06-17-2024
10:10 PM
Hi @RangaReddy , thank you very much for your response and suggestions. I tried the steps you recommended, and while they were helpful, I found that the issue was ultimately resolved by increasing the executor memory and by setting the spark.file.transferTo=false. I appreciate your assistance.
... View more
06-17-2024
07:30 AM
Hi @EFasdfSDfaSDFG From Hive the following formats supported : Parquet (default), Avro, ORC Create table examples: CREATE EXTERNAL TABLE test_ice_1 ( i INT, t TIMESTAMP, j BIGINT) STORED BY ICEBERG; CREATE EXTERNAL TABLE test_ice_2 (i INT, t TIMESTAMP) PARTITIONED BY (j BIGINT) STORED BY ICEBERG; CREATE EXTERNAL TABLE test_ice_3 (i int) STORED AS ORC STORED BY ICEBERG LOCATION ''; CREATE EXTERNAL TABLE test_ice_4 (i int) STORED BY ICEBERG TBLPROPERTIES ('key'='value', 'key'='value') CREATE EXTERNAL TABLE test_ice_1 (i int) STORED AS ORC STORED BY ICEBERG TBLPROPERTIES ('format-version' = '2');
... View more
05-16-2024
02:59 AM
2 Kudos
Yes, this is it. Thank you so much for the prompt response.
... View more
04-05-2024
12:12 AM
5 Kudos
Installing Spark3 and Livy3 on Cloudera Manager with CDS3 Parcel Apache Spark Logo 1. Introduction This guide outlines the steps for installing Apache Spark 3.x on your Cloudera cluster by leveraging Cloudera Manager (CM) and CDS3 parcels. Learn how to efficiently download, distribute, and activate the CDS3 parcel for a seamless Spark3 deployment, saving you time and effort compared to traditional methods. Additionally, the guide provides resources for troubleshooting any potential issues encountered during the installation. Note: This article, mainly focuses on Cloudera Manager Server does have Internet access. 2. CDP-compatible CDS3 Parcels Details CDS 3 Powered by Apache Spark is an add-on service for CDP Private Cloud Base, distributed as a parcel. The Cloudera Service Descriptor(CDS) file is available in Cloudera Manager for CDP 7.1.X. The CDS version label is constructed in v.v.v.w.w.xxxx.y-z...z format and carries the following information: v.v.v - Apache Spark upstream version, for example, 3.3.2 w.w - Cloudera internal version number, 3.3 xxxx - CDP version number, 7190 (referring to CDP Private Cloud Base 7.1.9) y - maintenance version, 0 z...z - build number, for example 91 Spark3 Base Parcel Location is https://archive.cloudera.com/p/spark3 CDP Version CDS3 Version Spark Version Parcel Repository CSD Installation Required? 7.1.9 CDS 3.3 3.3.2.3.3.7190.0-91 https://archive.cloudera.com/p/spark3/3.3.7190.0/parcels/ No 7.1.8 CDS 3.3 3.3.0.3.3.7180.0-274 https://archive.cloudera.com/p/spark3/3.3.7180.0/parcels/ No 7.1.7 SP2 CDS 3.2.3 3.2.3.3.2.7172000.0-334 https://archive.cloudera.com/p/spark3/3.2.7172000.0/parcels/ Yes 7.1.7 SP1 CDS 3.2.3 3.2.1.3.2.7171000.0-3 https://archive.cloudera.com/p/spark3/3.2.7171000.0/parcels/ Yes Note(s): Ensure you install the latest Parcel version because frequently parcel versions are updated. 3. Prerequisites CDP Private Cloud Base cluster with version 7.1.7 and above Prepare your Cloudera Manager server and cluster nodes with internet access for downloading necessary dependencies. Based on the Spark version, you need to install the required Java version and Python version. Spark Shuffle port needs to be opened on Firewall if the hosts are using Firewall restrictions. The Shuffle port is configurable, and it defaults to 7447 4. Installation Steps The CDS 3 parcel consists of two components: Custom Service Descriptor (CSD) file: A CSD file defines the configuration for managing a new service and it is typically provided as a JAR file. Parcel file: A parcel is a binary distribution format containing the program files, along with additional metadata used by Cloudera Manager. Installation depends on your CDP version: CDP versions before 7.1.8: You need to install the CSD file(s) and Parcel file separately. CDP versions 7.1.8 and later: For CDP versions 7.1.8 and above, Spark3 and Livy for Spark3 CSD files are included directly within Cloudera Manager. Therefore, there's no need for separate external CSD files for these components. Step 1: Install CSD (Custom Service Descriptor) files. (Required for CDP version 7.1.7 only) Log on to the Cloudera Manager Server host and go to the CDS Powered by Apache Spark service descriptor in the location configured for service descriptor files. By default, the CSD location is /opt/cloudera/csd. cd /opt/cloudera/csd Download the CDS 3.2.3 service descriptor files. Note: You need to replace the following values before running the wget command: Replace the `username` and `password`. Replace the `csd_cdp_version`. For example `3.2.7172000.0`. Replace the `spark3_csd_version`. For example `3.2.3.3.2.7172000.0-334`. Replace the `livy3_csd_version`. For example `0.6.3000.3.2.7172000.0-334`. wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/SPARK3_ON_YARN-<spark3_csd_version>.jar
wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/LIVY_FOR_SPARK3-<livy3_csd_version>.jar Set the file ownership of the service descriptor to cloudera-scm:cloudera-scm with permission 644. chown cloudera-scm:cloudera-scm *
chmod 644 * After changing ownership you can see similar output: -rw-r--r-- 1 cloudera-scm cloudera-scm 17216 Feb 10 2023 LIVY_FOR_SPARK3-0.6.3000.3.2.7172000.0-334.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm 20227 Feb 10 2023 SPARK3_ON_YARN-3.2.3.3.2.7172000.0-334.jar Restart the Cloudera Manager Server with the following command: systemctl restart cloudera-scm-server Step 2. Add the CDS Parcel Repository Log in to the Cloudera Manager Admin Console and Click Parcels from the left menu. Click Parcel Repositories & Network Settings. In the Remote Parcel Repository URLs section, click the + icon. Enter the CDS3 parcel repository URL provided by Cloudera (See 2. CDP compatible CDS3 Parcels Details section) Click Save & Verify Configuration. A message with the status of the verification appears above the Remote Parcel Repository URLs section. If the URL is not valid, check the URL and enter the correct URL. After the URL is verified, click Close. Locate the row in the table that contains the new Cloudera Runtime parcel i.e. SPARK3 and click the Download button. After the SPARK3 parcel is downloaded, click the Distribute button to distribute the parcel to all the cluster nodes. Wait for the parcel to be distributed. Cloudera Manager displays the status of the Cloudera Runtime parcel distribution. By default, Spark3 parcel will be downloaded and distributed to /opt/cloudera/parcels/ location. After the SPARK3 parcel is distributed, click the Activate button to activate the parcel on all the cluster nodes. When prompted, click on OK. Now, you can see SPARK3 parcel Status is Distributed, Activated A symbolic link named SPARK3 has now been created in the /opt/cloudera/parcels/ directory. Step 3. Add Spark3 Service Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Click on Actions or More Options (ellipsis icon), then click Add Service. Select Spark 3, then click Continue. Based on your requirements you can select any Optional Dependencies, such as Atlas, HBase, Kafka, Knox, or select No Optional Dependencies, then click Continue. Select Assign Roles by selecting the host where the History Server is to be added. You can add a gateway role to every host. Review Changes then click Continue. On Command Details, select run options, confirm success, then click Continue. On Summary, click Finish. The Spark 3 service appears in the Cloudera Manager cluster components list. If Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Spark 3 service. Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services. Restart all services with stale configurations. Step 4. Verify Spark3 Installation Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 Verify Spark 3 service from the list of services. Verify that the Spark 3 service is started and healthy. Step 5. Add Livy3 Service Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> click on Actions or More Options (ellipsis icon), then click Add Service. Select Livy for Spark 3, then click Continue. Based on your requirements you can select any Optional Dependencies, such as Hive or select No Optional Dependencies, then click Continue. Select Assign Roles by selecting the host where Livy Server for Spark 3 is to be added. You can also add the Gateway (optional) but recommended. Review Changes then click Continue. On Command Details, select run options, confirm success, then click Continue. On Summary, click Finish. The Livy for Spark 3 service appears in the Cloudera Manager cluster components list. If Livy for Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Livy for Spark 3 service. Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services. Restart all services with stale configurations. Step 6. Verify Livy3 Installation Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 Verify Livy for Spark 3 service from the list of services. Verify that the Livy for Spark 3 service is started and healthy. 5. Running SparkPi example using spark-examples.jar file You can use the following sample Spark Pi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line. Running SparkPi in YARN Client Mode: spark3-submit \
--master yarn \
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10 You will see a similar output in the console. Pi is roughly 3.142279142279142 Running SparkPi in YARN Cluster Mode: spark3-submit \
--master yarn \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10 6. Setting Up and Configuring PySpark Python installation is mandatory for running any PySpark application. Before launching a PySpark application, ensure Python is installed and configured within the Spark environment. Python installation is typically required on each node where the PySpark application executes. While some operating systems come pre-installed with Python, others do not. It's crucial to verify that a Spark-supported Python version(s) are installed on each node with a consistent location on your cluster. The following step(s) can be skipped if you've already installed a Spark-supported Python version that's compatible with your operating system. Custom Python library Configuration Specify the Python binary to be used by the Spark driver and executors by setting the PYSPARK_PYTHON environment variable in spark-env.sh. We can also override the driver Python binary path individually using the PYSPARK_DRIVER_PYTHON environment variable. These settings apply regardless of whether you are using yarn client or cluster mode. Make sure to set the variables using the export statement. For example: export PYSPARK_PYTHON=${PYSPARK_PYTHON:-<path_to_python_executable>} Here are some example Python binary paths: Anaconda parcel: /opt/cloudera/parcels/Anaconda3/bin/python Virtual environment: /path/to/virtualenv/bin/python If you are using yarn cluster mode, in addition to the above, set spark.yarn.appMasterEnv.PYSPARK_PYTHON and spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON in spark-defaults.conf (using the safety valve) to the same paths. Setting the Custom Python Path steps The following steps assume you have installed a Python version compatible with your Spark installation. Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Go to the Spark 3 service -> Click the Configuration tab. Search for Spark 3 Service Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-env.sh --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3. export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/python3}
export PYSPARK_DRIVER_PYTHON=${PYSPARK_DRIVER_PYTHON:-/usr/bin/python3} NOTE: Use your python3 location, for example /usr/bin/python3. Search for Spark 3 Client Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-defaults.conf --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3. spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3 NOTE: Use your python3 location, for example /usr/bin/python3. Enter a Reason for change, and click Save Changes to commit the changes. Restart the Spark 3 service. Deploy the client configuration. 7. Running PySpark SparkPi Example using pi.py file You can use the following sample PySpark SparkPi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line. 1) Running Pyspark SparkPi in YARN Client Mode: spark3-submit \
--master yarn \
--deploy-mode client \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py You will see a similar output in the console. Pi is roughly 3.132920 Running PySpark SparkPi in YARN Cluster Mode: spark3-submit \
--master yarn \
--deploy-mode cluster \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py 8. Reference(s) Cloudera.com Downloads - Spark3 Cloudera Documentation- Spark 3 Overview Cloudera Documentation Running Spark Applications - Spark Python Anaconda Documentation - Working with conda - Cloudera
... View more
04-03-2024
12:17 AM
I think I've found the reason for the problem. It's not related to the Spark version. I used the Java process analysis tool Arthas to investigate and found that the AM startup process was blocked at the creation of the Timeline client. And this problem might be due to our TimeLine service using an embedded HBase service. When we configured the HBase used by the TimeLine service to our production environment's HBase, the problem disappeared.
... View more
03-06-2024
02:16 AM
1 Kudo
Hi @Sidhartha Could you please try the following sample def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "short" => ShortType
case "int" => IntegerType
case "bigint" => LongType
case "float" => FloatType
case "double" => DoubleType
case "decimal" => DecimalType(38,30)
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}
val input_data = List(Row(1l, "Ranga", 27, BigDecimal(306.000000000000000000)), Row(2l, "Nishanth", 6, BigDecimal(606.000000000000000000)))
val input_rdd = spark.sparkContext.parallelize(input_data)
val hiveCols="id:bigint,name:string,age:int,salary:decimal"
val schemaList = hiveCols.split(",")
val schemaStructType = new StructType(schemaList.map(col => col.split(":")).map(e => StructField(e(0), convertDatatype(e(1)), true)))
val myDF = spark.createDataFrame(input_rdd, schemaStructType)
myDF.printSchema()
myDF.show()
val myDF2 = myDF.withColumn("new_salary", col("salary").cast("double"))
myDF2.printSchema()
myDF2.show()
... View more
02-27-2024
10:10 PM
Spark Scala Version Compatibility Matrix
1. Introduction
Apache Spark being a widely used framework for big data processing, relies heavily on Scala as its primary programming language. Ensuring compatibility between different versions of Spark and Scala is essential for developers to leverage the latest features and optimizations while maintaining stability in their Spark applications.
In this article, we'll provide a comprehensive overview of the compatibility matrix between different versions of Spark and Scala, helping developers choose the right combination for their projects.
Key Considerations:
Several factors influence compatibility between Spark and Scala versions:
API changes: New features or modifications in Spark APIs might require specific Scala versions for proper compilation and execution.
Library dependencies: Third-party libraries used within your Spark application might have compatibility requirements with both Spark and Scala versions.
Community support: Older Spark versions might have limited community support and resources, impacting problem-solving and maintenance.
2. Spark Scala Version Compatibility Matrix
Here's a compatibility matrix for Spark and Scala versions:
Spark Version
Supported Scala Binary Version(s)
Cloudera Supported Binary Version(s)
Scala 2.11
Scala 2.12
Scala 2.13
3.5.0
2.12/2.13
2.12
2.12.18
2.13.8
3.4.2
2.12/2.13
2.12
2.12.17
2.13.8
3.4.1
2.12/2.13
2.12
2.12.17
2.13.8
3.4.0
2.12/2.13
2.12
2.12.17
2.13.8
3.3.4
2.12/2.13
2.12
2.12.15
2.13.8
3.3.3
2.12/2.13
2.12
2.12.15
2.13.8
3.3.2
2.12/2.13
2.12
2.12.15
2.13.8
3.3.1
2.12/2.13
2.12
2.12.15
2.13.8
3.3.0
2.12/2.13
2.12
2.12.15
2.13.8
3.2.4
2.12/2.13
2.12
2.12.15
2.13.5
3.2.3
2.12/2.13
2.12
2.12.15
2.13.5
3.2.2
2.12/2.13
2.12
2.12.15
2.13.5
3.2.1
2.12/2.13
2.12
2.12.15
2.13.5
3.2.0
2.12/2.13
2.12
2.12.15
2.13.5
3.1.3
2.12
2.12
2.12.10
3.1.2
2.12
2.12
2.12.10
3.1.1
2.12
2.12
2.12.10
3.0.3
2.12
2.12
2.12.10
3.0.2
2.12
2.12
2.12.10
3.0.1
2.12
2.12
2.12.10
3.0.0
2.12
2.12
2.12.10
2.4.8
2.11/2.12
2.11
2.11.12
2.12.10
2.4.7
2.11/2.12
2.11
2.11.12
2.12.10
2.4.6
2.11/2.12
2.11
2.11.12
2.12.10
2.4.5
2.11/2.12
2.11
2.11.12
2.12.10
2.4.4
2.11/2.12
2.11
2.11.12
2.12.8
2.4.3
2.11/2.12
2.11
2.11.12
2.12.8
2.4.2
2.11/2.12
2.11
2.11.12
2.12.8
2.4.1
2.11/2.12
2.11
2.11.12
2.12.8
2.4.0
2.11/2.12
2.11
2.11.12
2.12.7
References:
Spark Project SQL cloudera-repos
Spark Project SQL
3. JDK & Scala compatibility
Minimum Scala versions:
JDK Version
Scala 3
Scala 2.13
Scala 2.12
Scala 2.11
22 (ea)
3.3.2
2.13.12
2.12.19
21 (LTS)
3.3.1
2.13.11
2.12.18
20
3.3.0
2.13.11
2.12.18
19
3.2.0
2.13.9
2.12.16
18
3.1.3
2.13.7
2.12.15
17 (LTS)
3.0.0
2.13.6
2.12.15
11 (LTS)
3.0.0
2.13.0
2.12.4
2.11.12
8 (LTS)
3.0.0
2.13.0
2.12.0
2.11.0
4. Conclusion
Ensuring the compatibility between Spark and Scala versions is crucial for the successful development and deployment of Spark applications. By referring to this compatibility matrix, developers can make informed decisions regarding the selection of Spark and Scala versions based on their project requirements and constraints.
Thank you for taking the time to read this article. We hope you found it informative and helpful in enhancing your understanding of the topic. If you have any questions or feedback, please feel free to reach out to me. Remember, your support motivates us to continue creating valuable content. If this article helped you, please consider giving it a like and providing a kudos. We appreciate your support!
... View more
02-25-2024
10:06 PM
https://community.cloudera.com/t5/Community-Articles/Spark-and-Java-versions-Supportability-Matrix/ta-p/383669
... View more