Member since
11-12-2018
218
Posts
179
Kudos Received
35
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
339 | 08-08-2025 04:22 PM | |
410 | 07-11-2025 08:48 PM | |
626 | 07-09-2025 09:33 PM | |
1122 | 04-26-2024 02:20 AM | |
1479 | 04-18-2024 12:35 PM |
06-28-2022
04:11 PM
Hi @ajaybabum, Yes we can able run Spark in local mode against the Kerberized cluster. For a quick test, can you directly open spark-shell to try reading the CSV file from the HDFS location and show the output of the contents to verify whether do you have any issue in the Cluster / Spark configuration or if it's more on your application code? >> Will it possible in local mode without run kinit command before spark-submit. -- By passing --keytab --principal details in your spark-submit, you don't need to run kinit command before spark-submit. Thanks
... View more
06-28-2022
04:05 PM
Hi @NaniSK, Please can you reach out to the Cloudera Certification Team at certification@cloudera.com regarding any feedback and/or concerns about your certificate and license. Thanks.
... View more
06-28-2022
03:31 PM
Hi @dfdf, I tried in my cluster with both Spark2 and Spark3 on the same version which you tried but I can able to get the results without any issues. Spark2: 2.4.7.7.1.7.1000-141 Spark3 : 3.2.1.3.2.7171000.1-1 Are you still seeing this issue? Please can you share the reproduce steps that I can try from my side to reproduce this issue in my cluster? Thanks
... View more
06-28-2022
02:07 PM
1 Kudo
Hi @NicolasMarcos, Thank you for expressing your interest in downloading the Cloudera Quickstart VM. But unfortunately, the Cloudera Quickstart VM has been discontinued. You can try the docker image of Cloudera available publicly on https://hub.docker.com/r/cloudera/quickstart or simply run the below command to download this on docker enabled system. docker pull cloudera/quickstart Please note, that Cloudera doesn't support QuickStart VM Officially and it's deprecated. The up-to-date product is Cloudera Data Platform, and you can download a trial version to install on-premises here.
... View more
06-27-2022
03:16 PM
Hi @sss123, this seems to be a bug. Please refer to https://issues.cloudera.org/browse/LIVY-3. Kindly note that Spark Notebook is not currently supported. Also please review the discussion in https://github.com/cloudera/hue/issues/254
... View more
06-27-2022
07:46 AM
Hi @ds_explorer, it seems because the edit log is too big and cannot be read by NameNode completely on the default/configured timeout. 2022-06-25 08:32:24,872 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 554705629. Expected transaction ID was 60366342312 Recent opcode offsets: 554704754 554705115 554705361 554705629 ..... Caused by: java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:203) at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LengthPrefixedReader.decodeOpFrame(FSEditLogOp.java:4488) To fix this, can you add the below parameter and value (if you already have then kindly increase the value) HDFS > Configuration > JournalNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml hadoop.http.idle_timeout.ms=180000 And then restart the required services.
... View more
06-25-2022
03:05 PM
It seems like your Spark workers are pointing to the default/system installation of python rather than your virtual environment. By setting the environment variable, you can tell Spark to use your virtual environment. You can set the below two configs in <spark_home_dir>/conf/spark-env.sh export PYSPARK_PYTHON=<Python_binaries_Path>
export PYSPARK_DRIVER_PYTHON=<Python_binaries_Path>
... View more
06-15-2022
10:41 AM
2 Kudos
Hi @suri789 Can you try this below and share your feedback? >>> df.show() +----------------+ | value | +----------------+ | N. Plainfield| |North Plainfield| | West Home Land| | NEWYORK| | newyork| | So. Plainfield| | S. Plaindield| | s Plaindield| |North Plainfield| +----------------+ >>> from pyspark.sql.functions import regexp_replace, lower >>> df_tmp=df.withColumn('value', regexp_replace('value', r'\.','')) >>> df_tmp.withColumn('value', lower(df_tmp.value)).distinct().show() +----------------+ | value | +----------------+ | s plaindield| | n plainfield| | west home land| | newyork| | so plainfield| |north plainfield| +----------------+
... View more
06-15-2022
09:40 AM
1 Kudo
Hi @dfdf, I am not able to reproduce this issue. I can able to get the table details while running the queries in the Spark3 session. Please can you help us with the exact version of Spark3 and Hive version that you are running on your environment? For example, you can get the value by running spark3-shell --version Please verify are you seeing any errors or alerts related to the Hive service? Also, can you try to run similar queries directly from Hive and see whether are you getting the results?
... View more
06-15-2022
08:35 AM
1 Kudo
Hi @haze5736, You need to use Hive Warehouse Connector (HWC) software to query Apache Hive managed tables from Apache Spark. Using HWC API you can read and write Apache Hive tables from Apache Spark. For example, to write the managed table. df.write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table", &tableName>).option("partition", <partition_spec>).save() Ref: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/integrating-hive-and-bi/topics/hive-read-write-operations.html For more details you can refer the below documentation: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/integrating-hive-and-bi/topics/hive_hivewarehouseconnector_for_handling_apache_spark_data.html https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/integrating-hive-and-bi/topics/hive_submit_a_hivewarehouseconnector_python.html
... View more