Member since
01-10-2021
8
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
335 | 03-10-2025 09:49 PM |
03-26-2025
09:57 PM
Hi All, I have been using spark 2.2 for long time in CDSW and recently trying to work in spark 3 in CDP. One of my queries is failing in spark 3 with an error of following Py4JJavaError: An error occurred while calling o96.sql. : org.apache.spark.SparkException: Cannot broadcast the table over 512000000 rows: 1235668051 rows at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBroadcastTableOverMaxTableRowsError(QueryExecutionErrors.scala:1824) Even though this same query runs fine in spark 2.2 in CDSW. My spark session configuration is following # SET GENERAL SPARK PROPERTIES # print(" Configuring General Spark Properties") spark_session_builder = spark_session_builder.appName(name="Wrangler-Routine") spark_session_builder = spark_session_builder.master(master="yarn") spark_session_builder = spark_session_builder.enableHiveSupport() spark_session_builder = spark_session_builder.config("spark.yarn.queue", "root.project") spark_session_builder = spark_session_builder.config("spark.kryoserializer.buffer", "128m") spark_session_builder = spark_session_builder.config("spark.kryoserializer.buffer.max", "2024m") # SET SPARK DRIVER PROPERTIES # print(" Configuring Spark Driver Properties") spark_session_builder = spark_session_builder.config("spark.driver.cores", "16") spark_session_builder = spark_session_builder.config("spark.driver.memory", "64g") spark_session_builder = spark_session_builder.config("spark.driver.memoryOverhead", "8g") spark_session_builder = spark_session_builder.config("spark.driver.maxResultSize", "16g") # SET SPARK EXECUTOR PROPERTIES # print(" Configuring Spark Executor Properties") spark_session_builder = spark_session_builder.config("spark.executors.instances", "16") spark_session_builder = spark_session_builder.config("spark.executor.cores", "8") spark_session_builder = spark_session_builder.config("spark.executor.memory", "8g") spark_session_builder = spark_session_builder.config("spark.executor.memoryOverhead", "8g") # SET SPARK SQL PROPERTIES # print(" Configuring Spark SQL Properties") spark_session_builder = spark_session_builder.config("spark.sql.crossJoin.enabled", "true") spark_session_builder = spark_session_builder.config("spark.sql.autoBroadcastJoinThreshold", "-1") spark_session_builder = spark_session_builder.config("spark.sql.adaptive.autoBroadcastJoinThreshold", "-1") # INSTANTIATE SPARK SESSION # print("Instantiating Spark Session") spark_session = spark_session_builder.getOrCreate() spark_session.sql("""my sql here""") what am I missing here?!
... View more
Labels:
03-10-2025
09:49 PM
2 Kudos
Sorry for my late response. I did try with the JDK version as mentioned in the driver documentation. It didn't work. However, I am now using keytab method for connecting and I am fine with it. @asish thanks a ton for all the support.
... View more
01-12-2021
05:10 AM
@Mamun_Shaheed CDP doesn’t support Python 3 and higher for CDH services. Here is the Software Dependency Note for reference: Python - CDP Private Cloud Base, with the exceptions of Hue and Spark, is supported on the Python version that is included in the operating system by default, as well as higher versions, but is not compatible with Python 3.0 or higher. For example, CDP Private Cloud Base requires Python 2.7 or higher on RHEL 7 compatible operating systems. Spark 2 requires Python 2.7 or higher, and supports Python 3. If the right level of Python is not picked up by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executable before running the pyspark command. I am assuming you want to use Python 3 for CDSW etc. For that you can use custom engine with required Python version which is independent with CDH services. Some reference docs are below. https://docs.cloudera.com/documentation/data-science-workbench/1-8-x/topics/cdsw_extensible_engines.html In short you can use the distinct Python in hosts but just make sure Cloudera services are using only the supported Python version.
... View more