About RangaReddy

RangaReddy · ‎10-11-2022

Hi @Ploeplse Still, if you are facing the issue, could you share the requested information (i.e code and impala table creation script)

RangaReddy · ‎09-30-2022

Hi @imule Add the following parameter to your spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=<python3_path> --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=<python3_path> Note: 1. Ensure python3_path exists in all nodes. 2. Ensure required modules are installed in each node.

RangaReddy · ‎09-21-2022

Hi @Boron Could you please set the spark-home environment variable like below before creating spark-session. import os os.environ['SPARK_HOME'] = '/usr/hdp/current/spark-client' Reference: https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home https://stackoverflow.com/questions/40087188/cant-find-spark-submit-when-typing-spark-shell

RangaReddy · ‎09-16-2022

Hi @poorva Please check the application logs for failed applications from Resource Manager UI, there exception message is printed. Fix the exception and resubmit the job.

RangaReddy · ‎09-14-2022

Hi @Ploeplse Could you please share reproducible sample code and impala tab creation script?

RangaReddy · ‎08-31-2022

Hi @Yosieam Please avoid calling read_file_log.collect() method. It will bring whole data to the driver and the driver needs to have more memory to hold that much data. Please check the modified code: move_to_rdd = sc.textFile("datalog2.log").map(lambda row : row.split("time=")).filter(lambda x : x != "") ReSymbol = move_to_rdd.map(lambda x : re.sub(r'\t', ' ', x)).map(lambda x : re.sub(r'\n', ' ', x)).map(lambda x : re.sub(r' +', ' ', x))

RangaReddy · ‎08-31-2022

Hi @mmk I think you have shared the following information. 7 nodes with each having 250 gb memory and vcpu = 32 per each node spark-defaults.conf spark.executor.memory = 100g spark.executor.memoryOverhead = 49g spark.driver.memoryOverhead=200g spark.driver.memory = 500g You have maximum of 250 gb for node and you have specified driver memory is (500gb and 200gb overhead). How it possible to driver to get 700gb? Generally you should not exceed the driver/executor memory beyond yarn physical memory. Coming to the actual problem, please avoid the show() to print 8000000 records. If you need to get the print the all values, then implement a logic to 1000 records at once and next 1000 records for another iteration. https://stackoverflow.com/questions/29227949/how-to-implement-spark-sql-pagination-query

RangaReddy · ‎08-31-2022

Hi @mmk By default, Hive will load all SerDe under the hive/lib location. So you are able to do the create/insert/select operations. In order to read the Hive table created with Custom or external SerDe we need to provide to spark, so spark internally use those libraries and it will load the Hive table data. If you are not provided the serde you can see the following exception: org.apache.hadoop.hive.serde2.SerDeException Please add the following library to the spark-submit command: json-serde-<version>.jar

RangaReddy · ‎08-31-2022

Hi @suri789 I think you haven't shared the full code, sample data and expected output to provide a solution. Please share the code proper format.

RangaReddy · ‎08-31-2022

Hi @AZIMKBC Please try to run the SparkPi example and see if is there any error in the logs. https://rangareddy.github.io/SparkPiExample/ If still issue is not resolved and you are a Cloudera customer please raise a case we will work on internally.

Online	Offline
Last Visited	‎08-29-2024 03:41 AM

Member Since	‎06-02-2020 05:25 AM
Last Visited	‎08-29-2024 03:41 AM
Posts	331
Kudos received	68

Cloudera Community

Re: Icebreg on CDP private cloud 7.1.9

Re: How to set default time zone/local time for Sp...

Re: Load Iceberg Table on PowerBI Desktop

Re: NoClassDefFoundError due to Incompatible Spark...

Re: Creating Iceberg table

Re: Pyspark Dataframe into Impala table ==> syntax...

Re: Spark-submit suddenly doesn't recognize module...

Re: Cannot get pyspark to work (Creating Spark Con...

Re: SparkContext was shut down

Re: Pyspark Dataframe into Impala table ==> syntax...

Re: Exception in thread "dispatcher-event-loop-61"...

Re: Unable to fetch more data like more than 7 mil...

Re: Unable to execute select queries from Pyspark ...

Re: How can I get the DISTINCT values with same na...

Re: I submit a Spark task in YARN mode, but the me...