Member since
06-02-2020
331
Posts
66
Kudos Received
49
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2266 | 07-11-2024 01:55 AM | |
6417 | 07-09-2024 11:18 PM | |
5145 | 07-09-2024 04:26 AM | |
4659 | 07-09-2024 03:38 AM | |
4680 | 06-05-2024 02:03 AM |
07-11-2024
01:55 AM
Hi @MoatazNader Yes, you can create/update/delete the iceberg table data using Impala in CDP 7.1.9 Creating table: https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/iceberg-how-to/topics/iceberg-table-creation.html Insert Data: https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/iceberg-how-to/topics/iceberg-insert-table-data.html Update/Delete: https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/iceberg-how-to/topics/iceberg-best-practice-row-modifications.html Reference: https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/iceberg-how-to/topics/iceberg-table-creation.html
... View more
07-09-2024
11:49 PM
2 Kudos
Introduction
The Spark and Iceberg Supportability Matrix provides comprehensive information regarding the compatibility and supportability of Spark and Iceberg versions with various operating systems, frameworks, and dependencies.
Apache Iceberg History
The development of Iceberg was started in 2017 by Netflix. The project was open-sourced and donated to the Apache Software Foundation in November 2018. In May 2020, the Iceberg project graduated to become a top-level Apache project.
Apache Iceberg 0.7.0 was released on Oct 26, 2019 (Incubating)
Apache Iceberg 0.8.0 was released on May 7, 2020 (Incubating).
Apache Iceberg 0.9.0 was released on Jul 14, 2020.
Apache Iceberg 0.9.1 was released on Aug 11, 2020.
Apache Iceberg 0.10.0 was released on Nov 12, 2020.
Apache Iceberg 0.11.0 was released on Jan 27, 2021.
Apache Iceberg 0.11.1 was released on Apr 3, 2021.
Apache Iceberg 0.12.0 was released on August 15, 2021.
Apache Iceberg 0.12.1 was released on November 8th, 2021.
Apache Iceberg 0.13.0 was released on February 4th, 2022.
Apache Iceberg 0.13.1 was released on February 14th, 2022.
Apache Iceberg 0.13.2 was released on June 15th, 2022.
Apache Iceberg 0.14.0 was released on July 16 2022.
Apache Iceberg 0.14.1 was released on Sep 12, 2022.
Apache Iceberg 1.0.0 was released on Nov 3, 2022.
Apache Iceberg 1.1.0 was released on November 28th, 2022.
Apache Iceberg 1.2.0 was released on March 20th, 2023.
Apache Iceberg 1.2.1 was released on April 11th, 2023.
Apache Iceberg 1.3.0 was released on May 30th, 2023.
Apache Iceberg 1.3.1 was released on July 25, 2023.
Apache Iceberg 1.4.0 was released on October 4, 2023.
Apache Iceberg 1.4.1 was released on October 23, 2023.
Apache Iceberg 1.4.2 was released on November 2, 2023.
Apache Iceberg 1.4.3 was released on December 27, 2023.
Apache Iceberg 1.5.0 was released on March 11, 2024.
Apache Iceberg 1.5.1 was released on April 25, 2024.
Apache Iceberg 1.5.2 was released on May 9, 2024.
Apache Spark and Iceberg Supportability Matrix Table
The following table explains the Iceberg Version Release Date Status Default Spark Version Supported Spark Version(s):
Iceberg Version
Release Date
Status
Default Spark Version
Supported Spark Version(s)
0.7.0
Oct 26, 2019
Incubating
2.4
2.4
0.8.0
May 07, 2020
Incubating
2.4
2.4
0.9.0
Jul 14, 2020
2.4,3.0
0.9.1
Aug 11, 2020
2.4,3.0
0.10.0
Nov 12, 2020
2.4,3.0
0.11.0
Jan 27, 2021
2.4,3.0
0.11.1
Apr 03, 2021
2.4,3.0
0.12.0
Aug 15, 2021
2.4,3.0,3.1
0.12.1
Nov 08, 2021
2.4,3.0,3.1
0.13.0
Feb 04, 2022
3.2
2.4,3.0,3.1,3.2
0.13.1
Feb 14, 2022
3.2
2.4,3.0,3.1,3.2
0.13.2
Jun 15, 2022
3.2
2.4,3.0,3.1,3.2
0.14.0
Jul 17, 2022
3.3
2.4,3.0,3.1,3.2,3.3
0.14.1
Sep 12, 2022
3.3
2.4,3.0,3.1,3.2,3.3
1.0.0
Nov 03, 2022
3.3
2.4,3.0,3.1,3.2,3.3
1.1.0
Nov 28, 2022
3.3
2.4,3.1,3.2,3.3
1.2.0
Mar 20, 2023
3.3
2.4,3.1,3.2,3.3
1.2.1
Apr 11, 2023
3.3
2.4,3.1,3.2,3.3
1.3.0
May 30, 2023
3.4
3.1,3.2,3.3,3.4
1.3.1
Jul 25, 2023
3.4
3.1,3.2,3.3,3.4
1.4.0
Oct 04, 2023
3.5
3.2,3.3,3.4,3.5
1.4.1
Oct 23, 2023
3.5
3.2,3.3,3.4,3.5
1.4.2
Nov 02, 2023
3.5
3.2,3.3,3.4,3.5
1.4.3
Dec 27, 2023
3.5
3.2,3.3,3.4,3.5
1.5.0
Mar 11, 2024
3.5
3.3,3.4,3.5
1.5.1
Apr 25, 2024
3.5
3.3,3.4,3.5
1.5.2
May 09, 2024
3.5
3.3,3.4,3.5
References
Iceberg Releases
Github Iceberg
Thank you for taking the time to read this article. We hope you found it informative and helpful in enhancing your understanding of the topic. If you have any questions or feedback, please feel free to contact me. Remember, your support motivates us to continue creating valuable content. If this article helped you, please consider giving it a like and providing a kudos. We appreciate your support!
... View more
07-09-2024
11:26 PM
1 Kudo
Based on event log files, you need to adjust Spark History Server settings. Could you please check SHS cleanup is enabled or not. If you enable spark automatically it clean the old event log files. To load larger event log files, you need to adjust the DAEMON_MEMORY_SIZE. You can refer the following article to adjust the SHS parameters: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
... View more
07-09-2024
06:35 PM
Thanks, @RangaReddy . It solved my problem. 👏
... View more
06-17-2024
10:10 PM
Hi @RangaReddy , thank you very much for your response and suggestions. I tried the steps you recommended, and while they were helpful, I found that the issue was ultimately resolved by increasing the executor memory and by setting the spark.file.transferTo=false. I appreciate your assistance.
... View more
06-17-2024
07:30 AM
Hi @EFasdfSDfaSDFG From Hive the following formats supported : Parquet (default), Avro, ORC Create table examples: CREATE EXTERNAL TABLE test_ice_1 ( i INT, t TIMESTAMP, j BIGINT) STORED BY ICEBERG; CREATE EXTERNAL TABLE test_ice_2 (i INT, t TIMESTAMP) PARTITIONED BY (j BIGINT) STORED BY ICEBERG; CREATE EXTERNAL TABLE test_ice_3 (i int) STORED AS ORC STORED BY ICEBERG LOCATION ''; CREATE EXTERNAL TABLE test_ice_4 (i int) STORED BY ICEBERG TBLPROPERTIES ('key'='value', 'key'='value') CREATE EXTERNAL TABLE test_ice_1 (i int) STORED AS ORC STORED BY ICEBERG TBLPROPERTIES ('format-version' = '2');
... View more
05-16-2024
02:59 AM
2 Kudos
Yes, this is it. Thank you so much for the prompt response.
... View more
04-05-2024
12:12 AM
6 Kudos
Installing Spark3 and Livy3 on Cloudera Manager with CDS3 Parcel Apache Spark Logo 1. Introduction This guide outlines the steps for installing Apache Spark 3.x on your Cloudera cluster by leveraging Cloudera Manager (CM) and CDS3 parcels. Learn how to efficiently download, distribute, and activate the CDS3 parcel for a seamless Spark3 deployment, saving you time and effort compared to traditional methods. Additionally, the guide provides resources for troubleshooting any potential issues encountered during the installation. Note: This article, mainly focuses on Cloudera Manager Server does have Internet access. 2. CDP-compatible CDS3 Parcels Details CDS 3 Powered by Apache Spark is an add-on service for CDP Private Cloud Base, distributed as a parcel. The Cloudera Service Descriptor(CDS) file is available in Cloudera Manager for CDP 7.1.X. The CDS version label is constructed in v.v.v.w.w.xxxx.y-z...z format and carries the following information: v.v.v - Apache Spark upstream version, for example, 3.3.2 w.w - Cloudera internal version number, 3.3 xxxx - CDP version number, 7190 (referring to CDP Private Cloud Base 7.1.9) y - maintenance version, 0 z...z - build number, for example 91 Spark3 Base Parcel Location is https://archive.cloudera.com/p/spark3 CDP Version CDS3 Version Spark Version Parcel Repository CSD Installation Required? 7.1.9 CDS 3.3 3.3.2.3.3.7190.0-91 https://archive.cloudera.com/p/spark3/3.3.7190.0/parcels/ No 7.1.8 CDS 3.3 3.3.0.3.3.7180.0-274 https://archive.cloudera.com/p/spark3/3.3.7180.0/parcels/ No 7.1.7 SP2 CDS 3.2.3 3.2.3.3.2.7172000.0-334 https://archive.cloudera.com/p/spark3/3.2.7172000.0/parcels/ Yes 7.1.7 SP1 CDS 3.2.3 3.2.1.3.2.7171000.0-3 https://archive.cloudera.com/p/spark3/3.2.7171000.0/parcels/ Yes Note(s): Ensure you install the latest Parcel version because frequently parcel versions are updated. 3. Prerequisites CDP Private Cloud Base cluster with version 7.1.7 and above Prepare your Cloudera Manager server and cluster nodes with internet access for downloading necessary dependencies. Based on the Spark version, you need to install the required Java version and Python version. Spark Shuffle port needs to be opened on Firewall if the hosts are using Firewall restrictions. The Shuffle port is configurable, and it defaults to 7447 4. Installation Steps The CDS 3 parcel consists of two components: Custom Service Descriptor (CSD) file: A CSD file defines the configuration for managing a new service and it is typically provided as a JAR file. Parcel file: A parcel is a binary distribution format containing the program files, along with additional metadata used by Cloudera Manager. Installation depends on your CDP version: CDP versions before 7.1.8: You need to install the CSD file(s) and Parcel file separately. CDP versions 7.1.8 and later: For CDP versions 7.1.8 and above, Spark3 and Livy for Spark3 CSD files are included directly within Cloudera Manager. Therefore, there's no need for separate external CSD files for these components. Step 1: Install CSD (Custom Service Descriptor) files. (Required for CDP version 7.1.7 only) Log on to the Cloudera Manager Server host and go to the CDS Powered by Apache Spark service descriptor in the location configured for service descriptor files. By default, the CSD location is /opt/cloudera/csd. cd /opt/cloudera/csd Download the CDS 3.2.3 service descriptor files. Note: You need to replace the following values before running the wget command: Replace the `username` and `password`. Replace the `csd_cdp_version`. For example `3.2.7172000.0`. Replace the `spark3_csd_version`. For example `3.2.3.3.2.7172000.0-334`. Replace the `livy3_csd_version`. For example `0.6.3000.3.2.7172000.0-334`. wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/SPARK3_ON_YARN-<spark3_csd_version>.jar
wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/LIVY_FOR_SPARK3-<livy3_csd_version>.jar Set the file ownership of the service descriptor to cloudera-scm:cloudera-scm with permission 644. chown cloudera-scm:cloudera-scm *
chmod 644 * After changing ownership you can see similar output: -rw-r--r-- 1 cloudera-scm cloudera-scm 17216 Feb 10 2023 LIVY_FOR_SPARK3-0.6.3000.3.2.7172000.0-334.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm 20227 Feb 10 2023 SPARK3_ON_YARN-3.2.3.3.2.7172000.0-334.jar Restart the Cloudera Manager Server with the following command: systemctl restart cloudera-scm-server Step 2. Add the CDS Parcel Repository Log in to the Cloudera Manager Admin Console and Click Parcels from the left menu. Click Parcel Repositories & Network Settings. In the Remote Parcel Repository URLs section, click the + icon. Enter the CDS3 parcel repository URL provided by Cloudera (See 2. CDP compatible CDS3 Parcels Details section) Click Save & Verify Configuration. A message with the status of the verification appears above the Remote Parcel Repository URLs section. If the URL is not valid, check the URL and enter the correct URL. After the URL is verified, click Close. Locate the row in the table that contains the new Cloudera Runtime parcel i.e. SPARK3 and click the Download button. After the SPARK3 parcel is downloaded, click the Distribute button to distribute the parcel to all the cluster nodes. Wait for the parcel to be distributed. Cloudera Manager displays the status of the Cloudera Runtime parcel distribution. By default, Spark3 parcel will be downloaded and distributed to /opt/cloudera/parcels/ location. After the SPARK3 parcel is distributed, click the Activate button to activate the parcel on all the cluster nodes. When prompted, click on OK. Now, you can see SPARK3 parcel Status is Distributed, Activated A symbolic link named SPARK3 has now been created in the /opt/cloudera/parcels/ directory. Step 3. Add Spark3 Service Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Click on Actions or More Options (ellipsis icon), then click Add Service. Select Spark 3, then click Continue. Based on your requirements you can select any Optional Dependencies, such as Atlas, HBase, Kafka, Knox, or select No Optional Dependencies, then click Continue. Select Assign Roles by selecting the host where the History Server is to be added. You can add a gateway role to every host. Review Changes then click Continue. On Command Details, select run options, confirm success, then click Continue. On Summary, click Finish. The Spark 3 service appears in the Cloudera Manager cluster components list. If Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Spark 3 service. Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services. Restart all services with stale configurations. Step 4. Verify Spark3 Installation Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 Verify Spark 3 service from the list of services. Verify that the Spark 3 service is started and healthy. Step 5. Add Livy3 Service Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> click on Actions or More Options (ellipsis icon), then click Add Service. Select Livy for Spark 3, then click Continue. Based on your requirements you can select any Optional Dependencies, such as Hive or select No Optional Dependencies, then click Continue. Select Assign Roles by selecting the host where Livy Server for Spark 3 is to be added. You can also add the Gateway (optional) but recommended. Review Changes then click Continue. On Command Details, select run options, confirm success, then click Continue. On Summary, click Finish. The Livy for Spark 3 service appears in the Cloudera Manager cluster components list. If Livy for Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Livy for Spark 3 service. Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services. Restart all services with stale configurations. Step 6. Verify Livy3 Installation Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 Verify Livy for Spark 3 service from the list of services. Verify that the Livy for Spark 3 service is started and healthy. 5. Running SparkPi example using spark-examples.jar file You can use the following sample Spark Pi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line. Running SparkPi in YARN Client Mode: spark3-submit \
--master yarn \
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10 You will see a similar output in the console. Pi is roughly 3.142279142279142 Running SparkPi in YARN Cluster Mode: spark3-submit \
--master yarn \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10 6. Setting Up and Configuring PySpark Python installation is mandatory for running any PySpark application. Before launching a PySpark application, ensure Python is installed and configured within the Spark environment. Python installation is typically required on each node where the PySpark application executes. While some operating systems come pre-installed with Python, others do not. It's crucial to verify that a Spark-supported Python version(s) are installed on each node with a consistent location on your cluster. The following step(s) can be skipped if you've already installed a Spark-supported Python version that's compatible with your operating system. Custom Python library Configuration Specify the Python binary to be used by the Spark driver and executors by setting the PYSPARK_PYTHON environment variable in spark-env.sh. We can also override the driver Python binary path individually using the PYSPARK_DRIVER_PYTHON environment variable. These settings apply regardless of whether you are using yarn client or cluster mode. Make sure to set the variables using the export statement. For example: export PYSPARK_PYTHON=${PYSPARK_PYTHON:-<path_to_python_executable>} Here are some example Python binary paths: Anaconda parcel: /opt/cloudera/parcels/Anaconda3/bin/python Virtual environment: /path/to/virtualenv/bin/python If you are using yarn cluster mode, in addition to the above, set spark.yarn.appMasterEnv.PYSPARK_PYTHON and spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON in spark-defaults.conf (using the safety valve) to the same paths. Setting the Custom Python Path steps The following steps assume you have installed a Python version compatible with your Spark installation. Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Go to the Spark 3 service -> Click the Configuration tab. Search for Spark 3 Service Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-env.sh --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3. export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/python3}
export PYSPARK_DRIVER_PYTHON=${PYSPARK_DRIVER_PYTHON:-/usr/bin/python3} NOTE: Use your python3 location, for example /usr/bin/python3. Search for Spark 3 Client Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-defaults.conf --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3. spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3 NOTE: Use your python3 location, for example /usr/bin/python3. Enter a Reason for change, and click Save Changes to commit the changes. Restart the Spark 3 service. Deploy the client configuration. 7. Running PySpark SparkPi Example using pi.py file You can use the following sample PySpark SparkPi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line. 1) Running Pyspark SparkPi in YARN Client Mode: spark3-submit \
--master yarn \
--deploy-mode client \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py You will see a similar output in the console. Pi is roughly 3.132920 Running PySpark SparkPi in YARN Cluster Mode: spark3-submit \
--master yarn \
--deploy-mode cluster \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py 8. Reference(s) Cloudera.com Downloads - Spark3 Cloudera Documentation- Spark 3 Overview Cloudera Documentation Running Spark Applications - Spark Python Anaconda Documentation - Working with conda - Cloudera
... View more
04-03-2024
12:17 AM
I think I've found the reason for the problem. It's not related to the Spark version. I used the Java process analysis tool Arthas to investigate and found that the AM startup process was blocked at the creation of the Timeline client. And this problem might be due to our TimeLine service using an embedded HBase service. When we configured the HBase used by the TimeLine service to our production environment's HBase, the problem disappeared.
... View more
03-06-2024
02:16 AM
1 Kudo
Hi @Sidhartha Could you please try the following sample def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "short" => ShortType
case "int" => IntegerType
case "bigint" => LongType
case "float" => FloatType
case "double" => DoubleType
case "decimal" => DecimalType(38,30)
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}
val input_data = List(Row(1l, "Ranga", 27, BigDecimal(306.000000000000000000)), Row(2l, "Nishanth", 6, BigDecimal(606.000000000000000000)))
val input_rdd = spark.sparkContext.parallelize(input_data)
val hiveCols="id:bigint,name:string,age:int,salary:decimal"
val schemaList = hiveCols.split(",")
val schemaStructType = new StructType(schemaList.map(col => col.split(":")).map(e => StructField(e(0), convertDatatype(e(1)), true)))
val myDF = spark.createDataFrame(input_rdd, schemaStructType)
myDF.printSchema()
myDF.show()
val myDF2 = myDF.withColumn("new_salary", col("salary").cast("double"))
myDF2.printSchema()
myDF2.show()
... View more