Member since
06-02-2023
12
Posts
0
Kudos Received
0
Solutions
04-14-2024
11:58 PM
1 Kudo
Hello @cirrus Thanks for using Cloudera Community. Kindly share a Screenshot of the UI to help explain your Team's Observation & Confirm the Versioning (Public Vs Private Cloud, CML Version) for us to review internally & get back to you accordingly. - Smarak
... View more
08-04-2023
04:21 AM
In other words: Why does this option not work: Web session timeouts (cloudera.com)
... View more
06-23-2023
01:23 AM
Thank you @haridjh ! It worked! I am even further confused because the underlying parquet file is already partitioned. But when inserting a repartition() the code works!
... View more
06-14-2023
01:25 AM
1 Kudo
Hi @cirrus You can find the following optimize code. /tmp/test_pyspark.py from pyspark.sql.functions import col, expr
from pyspark.sql import SparkSession
from datetime import datetime
import math
spark = SparkSession.builder \
.appName('Test App') \
.getOrCreate()
num_rows = 2350000
num_columns = 2500
records_per_file=5000
num_partitions = int(math.ceil(num_rows/records_per_file))
data = spark.range(num_rows).repartition(num_partitions)
print("Number of Partitions: " + str(data.rdd.getNumPartitions()))
start_time = datetime.now()
data = data.select(*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)])
#data = data.select("*",*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)])
end_time = datetime.now()
delta = end_time - start_time
# time difference in seconds
print("Time difference to select the columns is "+ str(delta.total_seconds()) +" seconds")
start_time = datetime.now()
data.write.format("parquet").mode("overwrite").save("/tmp/test")
end_time = datetime.now()
delta = end_time - start_time
# time difference in seconds
print("Time difference for writing the data to HDFS is "+ str(delta.total_seconds()) +" seconds")
spark.stop() Spark-submit command: spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.driver.memory=16G \
--conf spark.driver.memoryOverhead=1g \
--conf spark.executor.memory=16G \
--conf spark.executor.memoryOverhead=1g \
--conf spark.memory.fraction=0.8 \
--conf spark.memory.storageFraction=0.4 \
--conf spark.executor.cores=5 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.extraJavaOptions="-Xss1024m" \
--conf spark.executor.extraJavaOptions="-Xss1024m" /tmp/test_pyspark.py
... View more
06-14-2023
01:06 AM
Fun fact for those interested: In order to have 8 cores running you need in this example minimum 64m as xss options. If you chose 32m, then it will not give a stackoverflow error, but only 4 cores will be running 😲
... View more