Member since
06-02-2020
292
Posts
47
Kudos Received
42
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
484 | 02-05-2024 08:58 PM | |
267 | 02-04-2024 07:46 AM | |
124 | 02-04-2024 07:44 AM | |
351 | 01-02-2024 03:38 AM | |
387 | 12-19-2023 03:23 AM |
03-06-2024
02:16 AM
1 Kudo
Hi @Sidhartha Could you please try the following sample def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "short" => ShortType
case "int" => IntegerType
case "bigint" => LongType
case "float" => FloatType
case "double" => DoubleType
case "decimal" => DecimalType(38,30)
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}
val input_data = List(Row(1l, "Ranga", 27, BigDecimal(306.000000000000000000)), Row(2l, "Nishanth", 6, BigDecimal(606.000000000000000000)))
val input_rdd = spark.sparkContext.parallelize(input_data)
val hiveCols="id:bigint,name:string,age:int,salary:decimal"
val schemaList = hiveCols.split(",")
val schemaStructType = new StructType(schemaList.map(col => col.split(":")).map(e => StructField(e(0), convertDatatype(e(1)), true)))
val myDF = spark.createDataFrame(input_rdd, schemaStructType)
myDF.printSchema()
myDF.show()
val myDF2 = myDF.withColumn("new_salary", col("salary").cast("double"))
myDF2.printSchema()
myDF2.show()
... View more
02-27-2024
10:10 PM
Spark Scala Version Compatibility Matrix
1. Introduction
Apache Spark being a widely used framework for big data processing, relies heavily on Scala as its primary programming language. Ensuring compatibility between different versions of Spark and Scala is essential for developers to leverage the latest features and optimizations while maintaining stability in their Spark applications.
In this article, we'll provide a comprehensive overview of the compatibility matrix between different versions of Spark and Scala, helping developers choose the right combination for their projects.
Key Considerations:
Several factors influence compatibility between Spark and Scala versions:
API changes: New features or modifications in Spark APIs might require specific Scala versions for proper compilation and execution.
Library dependencies: Third-party libraries used within your Spark application might have compatibility requirements with both Spark and Scala versions.
Community support: Older Spark versions might have limited community support and resources, impacting problem-solving and maintenance.
2. Spark Scala Version Compatibility Matrix
Here's a compatibility matrix for Spark and Scala versions:
Spark Version
Supported Scala Binary Version(s)
Cloudera Supported Binary Version(s)
Scala 2.11
Scala 2.12
Scala 2.13
3.5.0
2.12/2.13
2.12
2.12.18
2.13.8
3.4.2
2.12/2.13
2.12
2.12.17
2.13.8
3.4.1
2.12/2.13
2.12
2.12.17
2.13.8
3.4.0
2.12/2.13
2.12
2.12.17
2.13.8
3.3.4
2.12/2.13
2.12
2.12.15
2.13.8
3.3.3
2.12/2.13
2.12
2.12.15
2.13.8
3.3.2
2.12/2.13
2.12
2.12.15
2.13.8
3.3.1
2.12/2.13
2.12
2.12.15
2.13.8
3.3.0
2.12/2.13
2.12
2.12.15
2.13.8
3.2.4
2.12/2.13
2.12
2.12.15
2.13.5
3.2.3
2.12/2.13
2.12
2.12.15
2.13.5
3.2.2
2.12/2.13
2.12
2.12.15
2.13.5
3.2.1
2.12/2.13
2.12
2.12.15
2.13.5
3.2.0
2.12/2.13
2.12
2.12.15
2.13.5
3.1.3
2.12
2.12
2.12.10
3.1.2
2.12
2.12
2.12.10
3.1.1
2.12
2.12
2.12.10
3.0.3
2.12
2.12
2.12.10
3.0.2
2.12
2.12
2.12.10
3.0.1
2.12
2.12
2.12.10
3.0.0
2.12
2.12
2.12.10
2.4.8
2.11/2.12
2.11
2.11.12
2.12.10
2.4.7
2.11/2.12
2.11
2.11.12
2.12.10
2.4.6
2.11/2.12
2.11
2.11.12
2.12.10
2.4.5
2.11/2.12
2.11
2.11.12
2.12.10
2.4.4
2.11/2.12
2.11
2.11.12
2.12.8
2.4.3
2.11/2.12
2.11
2.11.12
2.12.8
2.4.2
2.11/2.12
2.11
2.11.12
2.12.8
2.4.1
2.11/2.12
2.11
2.11.12
2.12.8
2.4.0
2.11/2.12
2.11
2.11.12
2.12.7
References:
Spark Project SQL cloudera-repos
Spark Project SQL
3. JDK & Scala compatibility
Minimum Scala versions:
JDK Version
Scala 3
Scala 2.13
Scala 2.12
Scala 2.11
22 (ea)
3.3.2
2.13.12
2.12.19
21 (LTS)
3.3.1
2.13.11
2.12.18
20
3.3.0
2.13.11
2.12.18
19
3.2.0
2.13.9
2.12.16
18
3.1.3
2.13.7
2.12.15
17 (LTS)
3.0.0
2.13.6
2.12.15
11 (LTS)
3.0.0
2.13.0
2.12.4
2.11.12
8 (LTS)
3.0.0
2.13.0
2.12.0
2.11.0
4. Conclusion
Ensuring the compatibility between Spark and Scala versions is crucial for the successful development and deployment of Spark applications. By referring to this compatibility matrix, developers can make informed decisions regarding the selection of Spark and Scala versions based on their project requirements and constraints.
Thank you for taking the time to read this article. We hope you found it informative and helpful in enhancing your understanding of the topic. If you have any questions or feedback, please feel free to reach out to me. Remember, your support motivates us to continue creating valuable content. If this article helped you, please consider giving it a like and providing a kudos. We appreciate your support!
... View more
02-25-2024
10:06 PM
https://community.cloudera.com/t5/Community-Articles/Spark-and-Java-versions-Supportability-Matrix/ta-p/383669
... View more
02-22-2024
10:51 PM
Spark and Java versions Supportability Matrix
1. Introduction:
Apache Spark is a powerful open-source distributed computing system widely used for big data processing and analytics. However, choosing the right Java version for your Spark application is crucial for optimal performance, security, and compatibility.
This article dives deep into the officially supported Java versions for Spark, along with helpful advice on choosing the right one for your project.
2. Matrix Table
Spark Version
Supported Java Version(s)
Java 8
Java 11
Java 17
Java 21
Deprecated Java Version(s)
1
3.5.0
Java 8*/11/17
Yes
Yes
Yes
No
Java 8 prior to version 8u371 support is deprecated
2
3.4.2
Java 8*/11/17
Yes
Yes
Yes
No
Java 8 prior to version 8u362 support is deprecated
3
3.4.1
Java 8*/11/17
Yes
Yes
Yes
No
Java 8 prior to version 8u362 support is deprecated
4
3.4.0
Java 8*/11/17
Yes
Yes
Yes
No
Java 8 prior to version 8u362 support is deprecated
5
3.3.3
Java 8*/11/17
Yes
Yes
Yes
No
Java 8 prior to version 8u201 support is deprecated
6
3.3.2
Java 8*/11/17
Yes
Yes
Yes
No
Java 8 prior to version 8u201 support is deprecated
7
3.3.1
Java 8*/11/17
Yes
Yes
Yes
No
Java 8 prior to version 8u201 support is deprecated
8
3.3.0
Java 8*/11/17^
Yes
Yes
No
No
Java 8 prior to version 8u201 support is deprecated
9
3.2.4
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u201 support is deprecated
10
3.2.3
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u201 support is deprecated
11
3.2.2
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u201 support is deprecated
12
3.2.1
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u201 support is deprecated
13
3.2.0
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u201 support is deprecated
14
3.1.3
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u92 support is deprecated
15
3.1.2
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u92 support is deprecated
16
3.1.1
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u92 support is deprecated
17
3.0.3
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u92 support is deprecated
18
3.0.2
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u92 support is deprecated
19
3.0.1
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u92 support is deprecated
20
3.0.0
Java 8*/11
Yes
Yes
No
No
Java 8 prior to version 8u92 support is deprecated
21
2.4.8
Java 8*
Yes
No
No
No
22
2.4.7
Java 8*
Yes
No
No
No
23
2.4.6
Java 8*
Yes
No
No
No
24
2.4.5
Java 8*
Yes
No
No
No
25
2.4.4
Java 8*
Yes
No
No
No
26
2.4.3
Java 8*
Yes
No
No
No
27
2.4.2
Java 8*
Yes
No
No
No
28
2.4.1
Java 8*
Yes
No
No
No
29
2.4.0
Java 8*
Yes
No
No
No
* means Cloudera recommended Java version.
^ means Upstream Spark is supported.
Note: According to the Cloudera documentation, Spark 3.3.0 only supports Java 8 and 11. However, the official Spark documentation lists Java 8, 11, and 17 as compatible versions.
References:
Apache Spark - A Unified engine for large-scale data analytics
CDS 3.3 Powered by Apache Spark Requirements
SPARK-24417
Support Matrix Cloudera
3. Problems Arising from Unsupported Spark & Java Versions
Utilizing incompatible or unsupported versions of Spark and Java can introduce various challenges and impediments in the operation of your Spark environment.
Performance Degradation: The usage of an incompatible Java version could lead to performance degradation or inefficiencies. This is attributable to the inability to leverage the latest optimizations and features provided by newer Java releases, resulting in the suboptimal performance of Spark jobs.
Compatibility Issues: Spark's functionality may be compromised or rendered unstable when interfacing with specific versions of Java. This can manifest as unexpected errors or failures during runtime, hindering the smooth execution of Spark applications.
Feature Limitations: Newer iterations of Spark may rely on features or enhancements exclusive to certain Java versions. Employing outdated or unsupported Java versions may curtail your ability to exploit these advanced features, constraining the capabilities and functionalities of your Spark applications.
4. End of Life (EOL) dates for Java versions:
Java Version
EOL Date
1
Java 8
31, December 2020 (Public Updates), Still supported with Long Term Support (LTS) until December 2030.
2
Java 11
30, September 2023 (Public Updates), Still supported with Long Term Support (LTS) until January 2032.
3
Java 17
30, September 2026 (Public Updates), Long Term Support (LTS) until September 2029.
4
Java 21
30, September 2028 (Public Updates), Long Term Support (LTS) until September 2031.
Reference(s):
Oracle Java SE Support Roadmap
Java version history
5. JDK & Scala compatibility
Minimum Scala versions:
JDK Version
Scala 3
Scala 2.13
Scala 2.12
Scala 2.11
22 (ea)
3.3.2
2.13.12
2.12.19
21 (LTS)
3.3.1
2.13.11
2.12.18
20
3.3.0
2.13.11
2.12.18
19
3.2.0
2.13.9
2.12.16
18
3.1.3
2.13.7
2.12.15
17 (LTS)
3.0.0
2.13.6
2.12.15
11 (LTS)
3.0.0
2.13.0
2.12.4
2.11.12
8 (LTS)
3.0.0
2.13.0
2.12.0
2.11.0
`*` = forthcoming; support available in nightly builds
Thank you for taking the time to read this article. We hope you found it informative and helpful in enhancing your understanding of the topic. If you have any questions or feedback, please feel free to reach out to me. Remember, your support motivates us to continue creating valuable content. If this article helped you, please consider giving it a like and providing a kudos. We appreciate your support!
... View more
02-05-2024
08:59 PM
If above answers are helped you, please accept as Solution. It will helpful for others.
... View more
02-05-2024
08:58 PM
Hi @sonnh The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion. When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed: --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.sql.hive.convertMetastoreOrc=false Reference: https://community.cloudera.com/t5/Support-Questions/Insert-overwrite-with-in-the-same-table-in-spark/m-p/242780 https://www.baifachuan.com/posts/da7bb348.html
... View more
02-04-2024
08:05 PM
1 Kudo
Hi @Meepoljd Please let me know still you need any help on this issue. If any of the above solutions is helped then mark Accept as Solution.
... View more
02-04-2024
08:11 AM
Hi @zhuw.bigdata To locate Spark logs, follow these steps: Access the Spark UI: Open the Spark UI in your web browser. Identify Nodes: Navigate to the Executors tab to view information about the driver and executor nodes involved in the Spark application. Determine Log Directory: Within the Spark UI, find the Hadoop settings section and locate the value of the yarn.nodemanager.log-dirs property. This specifies the base directory for Spark logs on the cluster. Access Log Location: Using a terminal or SSH, log in to the relevant node (driver or executor) where the logs you need are located. Navigate to Application Log Directory: Within the yarn.nodemanager.log-dirs directory, access the subdirectory for the specific application using the pattern application_${appid}, where ${appid} is the unique application ID of the Spark job. Find Container Logs: Within the application directory, locate the individual container log directories named container_{$contid}, where ${contid} is the container ID. Review Log Files: Each container directory contains the following log files generated by that container: stderr: Standard error output stdin: Standard input (if applicable) syslog: System-level logs
... View more
02-04-2024
08:02 AM
Hi @zenaskun001 Could you please provide more details to check your issue. Check the following things: 1. By default SparkInterpreter will be installed. Check in your case $ZEPPELIN_HOME/interpreters location Spark Interpreter is installed or not. 2. After proper installation, you need to restart the Zeppelin and its related Components like Spark. 3. After restarting the Zeppelin service, try to login and check the Interpreter is installed or not. 4. As a last step, you need to check the Zeppelin logs (/var/log/zeppelin) path.
... View more
02-04-2024
07:58 AM
Hi @sonnh Generally it is not advisable to read and write the same table at a time. It can result in anything between data corruption and complete data loss in case of failure. As a temporary solution, First create a temporary view by reading the table data and later you can use that data and finally save the data to destination table. Reference: https://stackoverflow.com/questions/38746773/read-from-a-hive-table-and-write-back-to-it-using-spark-sql https://issues.apache.org/jira/browse/SPARK-27030
... View more