About cirrus

smdas · ‎04-14-2024

Hello @cirrus Thanks for using Cloudera Community. Kindly share a Screenshot of the UI to help explain your Team's Observation & Confirm the Versioning (Public Vs Private Cloud, CML Version) for us to review internally & get back to you accordingly. - Smarak

cirrus · ‎08-04-2023

In other words: Why does this option not work: Web session timeouts (cloudera.com)

cirrus · ‎06-23-2023

Thank you @haridjh ! It worked! I am even further confused because the underlying parquet file is already partitioned. But when inserting a repartition() the code works!

RangaReddy · ‎06-14-2023

Hi @cirrus You can find the following optimize code. /tmp/test_pyspark.py from pyspark.sql.functions import col, expr from pyspark.sql import SparkSession from datetime import datetime import math spark = SparkSession.builder \ .appName('Test App') \ .getOrCreate() num_rows = 2350000 num_columns = 2500 records_per_file=5000 num_partitions = int(math.ceil(num_rows/records_per_file)) data = spark.range(num_rows).repartition(num_partitions) print("Number of Partitions: " + str(data.rdd.getNumPartitions())) start_time = datetime.now() data = data.select(*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)]) #data = data.select("*",*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)]) end_time = datetime.now() delta = end_time - start_time # time difference in seconds print("Time difference to select the columns is "+ str(delta.total_seconds()) +" seconds") start_time = datetime.now() data.write.format("parquet").mode("overwrite").save("/tmp/test") end_time = datetime.now() delta = end_time - start_time # time difference in seconds print("Time difference for writing the data to HDFS is "+ str(delta.total_seconds()) +" seconds") spark.stop() Spark-submit command: spark-submit \ --master yarn \ --deploy-mode cluster \ --conf spark.driver.memory=16G \ --conf spark.driver.memoryOverhead=1g \ --conf spark.executor.memory=16G \ --conf spark.executor.memoryOverhead=1g \ --conf spark.memory.fraction=0.8 \ --conf spark.memory.storageFraction=0.4 \ --conf spark.executor.cores=5 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.driver.extraJavaOptions="-Xss1024m" \ --conf spark.executor.extraJavaOptions="-Xss1024m" /tmp/test_pyspark.py

cirrus · ‎06-14-2023

Fun fact for those interested: In order to have 8 cores running you need in this example minimum 64m as xss options. If you chose 32m, then it will not give a stackoverflow error, but only 4 cores will be running 😲

Online	Offline
Last Visited	‎10-07-2024 09:30 AM

Member Since	‎06-02-2023 07:10 AM
Last Visited	‎10-07-2024 09:30 AM
Posts	12

Cloudera Community

Re: PBJ Workbench: All infos/warnings appear in re...

Re: Timeout: PBJ session not going idle

Re: KryoSerializer and toPandas():What if kryoseri...

Re: Performance issue with simple reproducible cas...

Re: OOM issues when writing into parquet