About Mamun_Shaheed

Mamun_Shaheed · ‎03-26-2025

Hi All, I have been using spark 2.2 for long time in CDSW and recently trying to work in spark 3 in CDP. One of my queries is failing in spark 3 with an error of following Py4JJavaError: An error occurred while calling o96.sql. : org.apache.spark.SparkException: Cannot broadcast the table over 512000000 rows: 1235668051 rows at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotBroadcastTableOverMaxTableRowsError(QueryExecutionErrors.scala:1824) Even though this same query runs fine in spark 2.2 in CDSW. My spark session configuration is following # SET GENERAL SPARK PROPERTIES # print(" Configuring General Spark Properties") spark_session_builder = spark_session_builder.appName(name="Wrangler-Routine") spark_session_builder = spark_session_builder.master(master="yarn") spark_session_builder = spark_session_builder.enableHiveSupport() spark_session_builder = spark_session_builder.config("spark.yarn.queue", "root.project") spark_session_builder = spark_session_builder.config("spark.kryoserializer.buffer", "128m") spark_session_builder = spark_session_builder.config("spark.kryoserializer.buffer.max", "2024m") # SET SPARK DRIVER PROPERTIES # print(" Configuring Spark Driver Properties") spark_session_builder = spark_session_builder.config("spark.driver.cores", "16") spark_session_builder = spark_session_builder.config("spark.driver.memory", "64g") spark_session_builder = spark_session_builder.config("spark.driver.memoryOverhead", "8g") spark_session_builder = spark_session_builder.config("spark.driver.maxResultSize", "16g") # SET SPARK EXECUTOR PROPERTIES # print(" Configuring Spark Executor Properties") spark_session_builder = spark_session_builder.config("spark.executors.instances", "16") spark_session_builder = spark_session_builder.config("spark.executor.cores", "8") spark_session_builder = spark_session_builder.config("spark.executor.memory", "8g") spark_session_builder = spark_session_builder.config("spark.executor.memoryOverhead", "8g") # SET SPARK SQL PROPERTIES # print(" Configuring Spark SQL Properties") spark_session_builder = spark_session_builder.config("spark.sql.crossJoin.enabled", "true") spark_session_builder = spark_session_builder.config("spark.sql.autoBroadcastJoinThreshold", "-1") spark_session_builder = spark_session_builder.config("spark.sql.adaptive.autoBroadcastJoinThreshold", "-1") # INSTANTIATE SPARK SESSION # print("Instantiating Spark Session") spark_session = spark_session_builder.getOrCreate() spark_session.sql("""my sql here""") what am I missing here?!

Mamun_Shaheed · ‎03-10-2025

Sorry for my late response. I did try with the JDK version as mentioned in the driver documentation. It didn't work. However, I am now using keytab method for connecting and I am fine with it. @asish thanks a ton for all the support.

GangWar · ‎01-12-2021

@Mamun_Shaheed CDP doesn’t support Python 3 and higher for CDH services. Here is the Software Dependency Note for reference: Python - CDP Private Cloud Base, with the exceptions of Hue and Spark, is supported on the Python version that is included in the operating system by default, as well as higher versions, but is not compatible with Python 3.0 or higher. For example, CDP Private Cloud Base requires Python 2.7 or higher on RHEL 7 compatible operating systems. Spark 2 requires Python 2.7 or higher, and supports Python 3. If the right level of Python is not picked up by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executable before running the pyspark command. I am assuming you want to use Python 3 for CDSW etc. For that you can use custom engine with required Python version which is independent with CDH services. Some reference docs are below. https://docs.cloudera.com/documentation/data-science-workbench/1-8-x/topics/cdsw_extensible_engines.html In short you can use the distinct Python in hosts but just make sure Cloudera services are using only the supported Python version.

Online	Offline
Last Visited	‎03-26-2025 10:52 PM

Member Since	‎01-10-2021 02:31 AM
Last Visited	‎03-26-2025 10:52 PM
Posts	8
Kudos received	2

Cloudera Community

Re: DBeaver connection issue with CDP Hive.

Broadcast error in spark 3

Re: DBeaver connection issue with CDP Hive.

Re: python 3.6 or higher installation on CDP