About jagadeesan

jagadeesan · ‎06-28-2022

Hi @ajaybabum, Yes we can able run Spark in local mode against the Kerberized cluster. For a quick test, can you directly open spark-shell to try reading the CSV file from the HDFS location and show the output of the contents to verify whether do you have any issue in the Cluster / Spark configuration or if it's more on your application code? >> Will it possible in local mode without run kinit command before spark-submit. -- By passing --keytab --principal details in your spark-submit, you don't need to run kinit command before spark-submit. Thanks

jagadeesan · ‎06-28-2022

Hi @NaniSK, Please can you reach out to the Cloudera Certification Team at certification@cloudera.com regarding any feedback and/or concerns about your certificate and license. Thanks.

jagadeesan · ‎06-28-2022

Hi @dfdf, I tried in my cluster with both Spark2 and Spark3 on the same version which you tried but I can able to get the results without any issues. Spark2: 2.4.7.7.1.7.1000-141 Spark3 : 3.2.1.3.2.7171000.1-1 Are you still seeing this issue? Please can you share the reproduce steps that I can try from my side to reproduce this issue in my cluster? Thanks

jagadeesan · ‎06-28-2022

Hi @NicolasMarcos, Thank you for expressing your interest in downloading the Cloudera Quickstart VM. But unfortunately, the Cloudera Quickstart VM has been discontinued. You can try the docker image of Cloudera available publicly on https://hub.docker.com/r/cloudera/quickstart or simply run the below command to download this on docker enabled system. docker pull cloudera/quickstart Please note, that Cloudera doesn't support QuickStart VM Officially and it's deprecated. The up-to-date product is Cloudera Data Platform, and you can download a trial version to install on-premises here.

jagadeesan · ‎06-27-2022

Hi @sss123, this seems to be a bug. Please refer to https://issues.cloudera.org/browse/LIVY-3. Kindly note that Spark Notebook is not currently supported. Also please review the discussion in https://github.com/cloudera/hue/issues/254

jagadeesan · ‎06-27-2022

Hi @ds_explorer, it seems because the edit log is too big and cannot be read by NameNode completely on the default/configured timeout. 2022-06-25 08:32:24,872 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 554705629. Expected transaction ID was 60366342312 Recent opcode offsets: 554704754 554705115 554705361 554705629 ..... Caused by: java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:203) at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LengthPrefixedReader.decodeOpFrame(FSEditLogOp.java:4488) To fix this, can you add the below parameter and value (if you already have then kindly increase the value) HDFS > Configuration > JournalNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml hadoop.http.idle_timeout.ms=180000 And then restart the required services.

jagadeesan · ‎06-25-2022

It seems like your Spark workers are pointing to the default/system installation of python rather than your virtual environment. By setting the environment variable, you can tell Spark to use your virtual environment. You can set the below two configs in <spark_home_dir>/conf/spark-env.sh export PYSPARK_PYTHON=<Python_binaries_Path> export PYSPARK_DRIVER_PYTHON=<Python_binaries_Path>

jagadeesan · ‎06-15-2022

jagadeesan · ‎06-15-2022

Hi @dfdf, I am not able to reproduce this issue. I can able to get the table details while running the queries in the Spark3 session. Please can you help us with the exact version of Spark3 and Hive version that you are running on your environment? For example, you can get the value by running spark3-shell --version Please verify are you seeing any errors or alerts related to the Hive service? Also, can you try to run similar queries directly from Hive and see whether are you getting the results?

jagadeesan · ‎06-15-2022

Hi @haze5736, You need to use Hive Warehouse Connector (HWC) software to query Apache Hive managed tables from Apache Spark. Using HWC API you can read and write Apache Hive tables from Apache Spark. For example, to write the managed table. df.write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table", &tableName>).option("partition", <partition_spec>).save() Ref: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/integrating-hive-and-bi/topics/hive-read-write-operations.html For more details you can refer the below documentation: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/integrating-hive-and-bi/topics/hive_hivewarehouseconnector_for_handling_apache_spark_data.html https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/integrating-hive-and-bi/topics/hive_submit_a_hivewarehouseconnector_python.html

Online	Offline
Last Visited	‎11-22-2025 11:24 AM

Member Since	‎11-12-2018 10:00 AM
Last Visited	‎11-22-2025 11:24 AM
Posts	218
Kudos received	179

Cloudera Community

Re: Migrating workloads from Spark 2 to Spark 3

Re: Looking for a supported version of Spark 3 for...

Re: Spark 3 Parcel Compatibility with CDP Private ...

Re: Apache Storm support in Cloudera

Re: Complete example for using spark MLlib for twi...

Re: Spark submit in local mode against a kerberiz...

Re: Cloudera Certification URL is not working

Re: spark3 for cdp error

Re: Is the cloudera quickstart VM still free to us...

Re: The Spark session could not be created in the ...

Re: Both the namenodes are down

Re: pyspark toPandas() works locally but fails in ...

Re: How to remove the space and dots and convert i...

Re: spark3 for cdp error

Re: pyspark - can not create managed table