Member since
06-02-2020
331
Posts
63
Kudos Received
49
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
721 | 07-11-2024 01:55 AM | |
1749 | 07-09-2024 11:18 PM | |
1736 | 07-09-2024 04:26 AM | |
1371 | 07-09-2024 03:38 AM | |
1545 | 06-05-2024 02:03 AM |
06-05-2024
02:03 AM
2 Kudos
Hi @EFasdfSDfaSDFG Use the following syntax to create a Iceberg table using Impala Editor in Hue: Impala:
CREATE TABLE IF NOT EXISTS ice_t (i int, s string, ts timestamp, d date) STORED BY ICEBERG;
INSERT INTO ice_t VALUES(1, 'Ranga', '2015-05-15 12:00:00', '2015-05-15');
select * from ice_t;
CREATE TABLE IF NOT EXISTS ice_ext (i int, s string, ts timestamp, d date)
PARTITIONED BY (state string) STORED BY ICEBERG;
INSERT INTO ice_ext SELECT 1, 'Ranga', '2015-05-15 12:00:00', '2015-05-15', 'Andhra';
select * from ice_ext;
CREATE TABLE ice_t2 ( i int, s string, ts timestamp, d date ) STORED AS ICEBERG LOCATION '/warehouse/tablespace/external/hive/ice_t2';
INSERT INTO ice_t2 VALUES(1, 'Ranga', '2015-05-15 12:00:00', '2015-05-15');
select * from ice_t2;
... View more
05-15-2024
11:53 PM
Hi @Bartlomiej You can use livy.rsc.sql.num-rows parameter to adjust the number of rows wants to display. CM --> Livy for Spark3 --> Search for "Livy Server for Spark 3 Advanced Configuration Snippet (Safety Valve) for livy3-conf/livy.conf" and add the above parameter with value. livy.rsc.sql.num-rows=5000 I have created a test_3000 table by inserting 1 to 2999 records. While running the query from Hue Spark3 editor you can see the following output:
... View more
04-21-2024
11:04 PM
1 Kudo
Hi @nagababu There are couples of issues reported for similar kind of behaviour i.e SPARK-34790 , SPARK-18105 , SPARK-32658 You can try the following things: Step1: Change the compression codec and run the application. For example, spark.io.compression.codec=snappy Step2: If step1 is not resolved then try to set spark.file.transferTo=false and rerun the application. Step3: You can set the following parameter and rerun the application --conf spark.sql.adaptive.fetchShuffleBlocksInBatch=false Step4: If any of the above steps are not resolved your issue then you can set the following parameters true and false and rerun the application. spark.network.crypto.enabled=true
spark.authenticate=true
spark.io.encryption.enabled=true Step5: If any of the above steps are not resolved your issue needs to tune the shuffle operation.
... View more
04-05-2024
12:12 AM
5 Kudos
Installing Spark3 and Livy3 on Cloudera Manager with CDS3 Parcel
Apache Spark Logo
1. Introduction
This guide outlines the steps for installing Apache Spark 3.x on your Cloudera cluster by leveraging Cloudera Manager (CM) and CDS3 parcels. Learn how to efficiently download, distribute, and activate the CDS3 parcel for a seamless Spark3 deployment, saving you time and effort compared to traditional methods. Additionally, the guide provides resources for troubleshooting any potential issues encountered during the installation.
Note: This article, mainly focuses on Cloudera Manager Server does have Internet access.
2. CDP-compatible CDS3 Parcels Details
CDS 3 Powered by Apache Spark is an add-on service for CDP Private Cloud Base, distributed as a parcel. The Cloudera Service Descriptor(CDS) file is available in Cloudera Manager for CDP 7.1.X.
The CDS version label is constructed in v.v.v.w.w.xxxx.y-z...z format and carries the following information:
v.v.v - Apache Spark upstream version, for example,
3.3.2 w.w - Cloudera internal version number,
3.3 xxxx - CDP version number, 7190 (referring to CDP Private Cloud Base 7.1.9) y - maintenance version,
0 z...z - build number, for example 91
Spark3 Base Parcel Location is https://archive.cloudera.com/p/spark3
CDP Version CDS3 Version Spark Version Parcel Repository CSD Installation Required?
7.1.9
CDS 3.3
3.3.2.3.3.7190.0-91
https://archive.cloudera.com/p/spark3/3.3.7190.0/parcels/
No
7.1.8
CDS 3.3
3.3.0.3.3.7180.0-274
https://archive.cloudera.com/p/spark3/3.3.7180.0/parcels/
No
7.1.7 SP2
CDS 3.2.3
3.2.3.3.2.7172000.0-334
https://archive.cloudera.com/p/spark3/3.2.7172000.0/parcels/
Yes
7.1.7 SP1
CDS 3.2.3
3.2.1.3.2.7171000.0-3
https://archive.cloudera.com/p/spark3/3.2.7171000.0/parcels/
Yes
Note(s): Ensure you install the latest Parcel version because frequently parcel versions are updated.
3. Prerequisites
CDP Private Cloud Base cluster with version 7.1.7 and above
Prepare your Cloudera Manager server and cluster nodes with internet access for downloading necessary dependencies.
Based on the Spark version, you need to install the required Java version and Python version.
4. Installation Steps
The CDS 3 parcel consists of two components:
Custom Service Descriptor (CSD) file: A CSD file defines the configuration for managing a new service and it is typically provided as a JAR file.
Parcel file: A parcel is a binary distribution format containing the program files, along with additional metadata used by Cloudera Manager.
Installation depends on your CDP version:
CDP versions before 7.1.8: You need to install the CSD file(s) and Parcel file separately.
CDP versions 7.1.8 and later: For CDP versions 7.1.8 and above, Spark3 and Livy for Spark3 CSD files are included directly within Cloudera Manager. Therefore, there's no need for separate external CSD files for these components.
Step 1: Install CSD (Custom Service Descriptor) files. (Required for CDP version 7.1.7 only)
Log on to the Cloudera Manager Server host and go to the CDS Powered by Apache Spark service descriptor in the location configured for service descriptor files. By default, the CSD location is /opt/cloudera/csd. cd /opt/cloudera/csd
Download the CDS 3.2.3 service descriptor files. Note: You need to replace the following values before running the wget command:
Replace the `username` and `password`.
Replace the `csd_cdp_version`. For example `3.2.7172000.0`.
Replace the `spark3_csd_version`. For example `3.2.3.3.2.7172000.0-334`.
Replace the `livy3_csd_version`. For example `0.6.3000.3.2.7172000.0-334`. wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/SPARK3_ON_YARN-<spark3_csd_version>.jar
wget https://<username>:<password>@archive.cloudera.com/p/spark3/<csd_cdp_version>/csd/LIVY_FOR_SPARK3-<livy3_csd_version>.jar
Set the file ownership of the service descriptor to cloudera-scm:cloudera-scm with permission 644. chown cloudera-scm:cloudera-scm *
chmod 644 * After changing ownership you can see similar output: -rw-r--r-- 1 cloudera-scm cloudera-scm 17216 Feb 10 2023 LIVY_FOR_SPARK3-0.6.3000.3.2.7172000.0-334.jar
-rw-r--r-- 1 cloudera-scm cloudera-scm 20227 Feb 10 2023 SPARK3_ON_YARN-3.2.3.3.2.7172000.0-334.jar
Restart the Cloudera Manager Server with the following command: systemctl restart cloudera-scm-server
Step 2. Add the CDS Parcel Repository
Log in to the Cloudera Manager Admin Console and Click Parcels from the left menu.
Click Parcel Repositories & Network Settings.
In the Remote Parcel Repository URLs section, click the + icon.
Enter the CDS3 parcel repository URL provided by Cloudera (See 2. CDP compatible CDS3 Parcels Details section)
Click Save & Verify Configuration. A message with the status of the verification appears above the Remote Parcel Repository URLs section. If the URL is not valid, check the URL and enter the correct URL.
After the URL is verified, click Close.
Locate the row in the table that contains the new Cloudera Runtime parcel i.e. SPARK3 and click the Download button.
After the SPARK3 parcel is downloaded, click the Distribute button to distribute the parcel to all the cluster nodes. Wait for the parcel to be distributed. Cloudera Manager displays the status of the Cloudera Runtime parcel distribution. By default, Spark3 parcel will be downloaded and distributed to /opt/cloudera/parcels/ location.
After the SPARK3 parcel is distributed, click the Activate button to activate the parcel on all the cluster nodes.
When prompted, click on OK.
Now, you can see SPARK3 parcel Status is Distributed, Activated
A symbolic link named SPARK3 has now been created in the /opt/cloudera/parcels/ directory.
Step 3. Add Spark3 Service
Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Click on Actions or More Options (ellipsis icon), then click Add Service.
Select Spark 3, then click Continue.
Based on your requirements you can select any Optional Dependencies, such as Atlas, HBase, Kafka, Knox, or select No Optional Dependencies, then click Continue.
Select Assign Roles by selecting the host where the History Server is to be added. You can add a gateway role to every host.
Review Changes then click Continue.
On Command Details, select run options, confirm success, then click Continue.
On Summary, click Finish.
The Spark 3 service appears in the Cloudera Manager cluster components list. If Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Spark 3 service.
Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services.
Restart all services with stale configurations.
Step 4. Verify Spark3 Installation
Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1
Verify Spark 3 service from the list of services.
Verify that the Spark 3 service is started and healthy.
Step 5. Add Livy3 Service
Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> click on Actions or More Options (ellipsis icon), then click Add Service.
Select Livy for Spark 3, then click Continue.
Based on your requirements you can select any Optional Dependencies, such as Hive or select No Optional Dependencies, then click Continue.
Select Assign Roles by selecting the host where Livy Server for Spark 3 is to be added. You can also add the Gateway (optional) but recommended.
Review Changes then click Continue.
On Command Details, select run options, confirm success, then click Continue.
On Summary, click Finish.
The Livy for Spark 3 service appears in the Cloudera Manager cluster components list. If Livy for Spark 3 was not started by the installation wizard, you can start the service by clicking Actions > Start in the Livy for Spark 3 service.
Click the stale configuration icon to launch the Stale Configuration wizard and restart the necessary services.
Restart all services with stale configurations.
Step 6. Verify Livy3 Installation
Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1
Verify Livy for Spark 3 service from the list of services.
Verify that the Livy for Spark 3 service is started and healthy.
5. Running SparkPi example using spark-examples.jar file
You can use the following sample Spark Pi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line.
Running SparkPi in YARN Client Mode: spark3-submit \
--master yarn \
--deploy-mode client \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10 You will see a similar output in the console. Pi is roughly 3.142279142279142
Running SparkPi in YARN Cluster Mode: spark3-submit \
--master yarn \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/jars/spark-examples_2.12.jar 10
6. Setting Up and Configuring PySpark
Python installation is mandatory for running any PySpark application. Before launching a PySpark application, ensure Python is installed and configured within the Spark environment. Python installation is typically required on each node where the PySpark application executes.
While some operating systems come pre-installed with Python, others do not. It's crucial to verify that a Spark-supported Python version(s) are installed on each node with a consistent location on your cluster.
The following step(s) can be skipped if you've already installed a Spark-supported Python version that's compatible with your operating system.
Custom Python library Configuration
Specify the Python binary to be used by the Spark driver and executors by setting the PYSPARK_PYTHON environment variable in spark-env.sh. We can also override the driver Python binary path individually using the PYSPARK_DRIVER_PYTHON environment variable. These settings apply regardless of whether you are using yarn client or cluster mode.
Make sure to set the variables using the export statement. For example:
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-<path_to_python_executable>}
Here are some example Python binary paths:
Anaconda parcel: /opt/cloudera/parcels/Anaconda3/bin/python
Virtual environment: /path/to/virtualenv/bin/python
If you are using yarn cluster mode, in addition to the above, set spark.yarn.appMasterEnv.PYSPARK_PYTHON and spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON in spark-defaults.conf (using the safety valve) to the same paths.
Setting the Custom Python Path steps
The following steps assume you have installed a Python version compatible with your Spark installation.
Navigate to Clusters -> [Your Cluster Name]. For example Cluster 1 -> Go to the Spark 3 service -> Click the Configuration tab.
Search for Spark 3 Service Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-env.sh --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3. export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/python3}
export PYSPARK_DRIVER_PYTHON=${PYSPARK_DRIVER_PYTHON:-/usr/bin/python3} NOTE: Use your python3 location, for example /usr/bin/python3.
Search for Spark 3 Client Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-defaults.conf --> Add the following two parameters by replacing the python3 path. For example /usr/bin/python3. spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3
NOTE: Use your python3 location, for example /usr/bin/python3.
Enter a Reason for change, and click Save Changes to commit the changes.
Restart the Spark 3 service.
Deploy the client configuration.
7. Running PySpark SparkPi Example using pi.py file
You can use the following sample PySpark SparkPi program to validate your Spark3 installation and explore how to run Spark3 jobs from the command line.
1) Running Pyspark SparkPi in YARN Client Mode: spark3-submit \
--master yarn \
--deploy-mode client \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py You will see a similar output in the console. Pi is roughly 3.132920
Running PySpark SparkPi in YARN Cluster Mode: spark3-submit \
--master yarn \
--deploy-mode cluster \
/opt/cloudera/parcels/SPARK3/lib/spark3/examples/src/main/python/pi.py
8. Reference(s)
Cloudera.com Downloads - Spark3
Cloudera Documentation- Spark 3 Overview
Cloudera Documentation Running Spark Applications - Spark Python
Anaconda Documentation - Working with conda - Cloudera
... View more
04-02-2024
11:03 PM
Hi @Meepoljd Sorry to inform you, we are not supported Spark3 installation in HDP. In order to use Spark3 you need to use CDP/CDE cluster only.
... View more
03-26-2024
06:20 AM
You can find examples in the following github: https://github.com/rangareddy/spark-python-test-using-conda/tree/main/python-compatibility-test/spark_component/pyspark_examples
... View more
03-25-2024
10:34 PM
Hi @Leonm We have already published Spark supported Python version(s) in the below article: https://community.cloudera.com/t5/Community-Articles/Spark-Python-Supportability-Matrix/ta-p/379144 Please let me know still you need pyspark udf example for testing?
... View more
03-06-2024
02:16 AM
1 Kudo
Hi @Sidhartha Could you please try the following sample def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "short" => ShortType
case "int" => IntegerType
case "bigint" => LongType
case "float" => FloatType
case "double" => DoubleType
case "decimal" => DecimalType(38,30)
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}
val input_data = List(Row(1l, "Ranga", 27, BigDecimal(306.000000000000000000)), Row(2l, "Nishanth", 6, BigDecimal(606.000000000000000000)))
val input_rdd = spark.sparkContext.parallelize(input_data)
val hiveCols="id:bigint,name:string,age:int,salary:decimal"
val schemaList = hiveCols.split(",")
val schemaStructType = new StructType(schemaList.map(col => col.split(":")).map(e => StructField(e(0), convertDatatype(e(1)), true)))
val myDF = spark.createDataFrame(input_rdd, schemaStructType)
myDF.printSchema()
myDF.show()
val myDF2 = myDF.withColumn("new_salary", col("salary").cast("double"))
myDF2.printSchema()
myDF2.show()
... View more
02-27-2024
10:10 PM
Spark Scala Version Compatibility Matrix
1. Introduction
Apache Spark being a widely used framework for big data processing, relies heavily on Scala as its primary programming language. Ensuring compatibility between different versions of Spark and Scala is essential for developers to leverage the latest features and optimizations while maintaining stability in their Spark applications.
In this article, we'll provide a comprehensive overview of the compatibility matrix between different versions of Spark and Scala, helping developers choose the right combination for their projects.
Key Considerations:
Several factors influence compatibility between Spark and Scala versions:
API changes: New features or modifications in Spark APIs might require specific Scala versions for proper compilation and execution.
Library dependencies: Third-party libraries used within your Spark application might have compatibility requirements with both Spark and Scala versions.
Community support: Older Spark versions might have limited community support and resources, impacting problem-solving and maintenance.
2. Spark Scala Version Compatibility Matrix
Here's a compatibility matrix for Spark and Scala versions:
Spark Version
Supported Scala Binary Version(s)
Cloudera Supported Binary Version(s)
Scala 2.11
Scala 2.12
Scala 2.13
3.5.0
2.12/2.13
2.12
2.12.18
2.13.8
3.4.2
2.12/2.13
2.12
2.12.17
2.13.8
3.4.1
2.12/2.13
2.12
2.12.17
2.13.8
3.4.0
2.12/2.13
2.12
2.12.17
2.13.8
3.3.4
2.12/2.13
2.12
2.12.15
2.13.8
3.3.3
2.12/2.13
2.12
2.12.15
2.13.8
3.3.2
2.12/2.13
2.12
2.12.15
2.13.8
3.3.1
2.12/2.13
2.12
2.12.15
2.13.8
3.3.0
2.12/2.13
2.12
2.12.15
2.13.8
3.2.4
2.12/2.13
2.12
2.12.15
2.13.5
3.2.3
2.12/2.13
2.12
2.12.15
2.13.5
3.2.2
2.12/2.13
2.12
2.12.15
2.13.5
3.2.1
2.12/2.13
2.12
2.12.15
2.13.5
3.2.0
2.12/2.13
2.12
2.12.15
2.13.5
3.1.3
2.12
2.12
2.12.10
3.1.2
2.12
2.12
2.12.10
3.1.1
2.12
2.12
2.12.10
3.0.3
2.12
2.12
2.12.10
3.0.2
2.12
2.12
2.12.10
3.0.1
2.12
2.12
2.12.10
3.0.0
2.12
2.12
2.12.10
2.4.8
2.11/2.12
2.11
2.11.12
2.12.10
2.4.7
2.11/2.12
2.11
2.11.12
2.12.10
2.4.6
2.11/2.12
2.11
2.11.12
2.12.10
2.4.5
2.11/2.12
2.11
2.11.12
2.12.10
2.4.4
2.11/2.12
2.11
2.11.12
2.12.8
2.4.3
2.11/2.12
2.11
2.11.12
2.12.8
2.4.2
2.11/2.12
2.11
2.11.12
2.12.8
2.4.1
2.11/2.12
2.11
2.11.12
2.12.8
2.4.0
2.11/2.12
2.11
2.11.12
2.12.7
References:
Spark Project SQL cloudera-repos
Spark Project SQL
3. JDK & Scala compatibility
Minimum Scala versions:
JDK Version
Scala 3
Scala 2.13
Scala 2.12
Scala 2.11
22 (ea)
3.3.2
2.13.12
2.12.19
21 (LTS)
3.3.1
2.13.11
2.12.18
20
3.3.0
2.13.11
2.12.18
19
3.2.0
2.13.9
2.12.16
18
3.1.3
2.13.7
2.12.15
17 (LTS)
3.0.0
2.13.6
2.12.15
11 (LTS)
3.0.0
2.13.0
2.12.4
2.11.12
8 (LTS)
3.0.0
2.13.0
2.12.0
2.11.0
4. Conclusion
Ensuring the compatibility between Spark and Scala versions is crucial for the successful development and deployment of Spark applications. By referring to this compatibility matrix, developers can make informed decisions regarding the selection of Spark and Scala versions based on their project requirements and constraints.
Thank you for taking the time to read this article. We hope you found it informative and helpful in enhancing your understanding of the topic. If you have any questions or feedback, please feel free to reach out to me. Remember, your support motivates us to continue creating valuable content. If this article helped you, please consider giving it a like and providing a kudos. We appreciate your support!
... View more
02-25-2024
10:06 PM
https://community.cloudera.com/t5/Community-Articles/Spark-and-Java-versions-Supportability-Matrix/ta-p/383669
... View more