Support Questions

cirrus · ‎06-13-2023

Hi,

This is a reproducible, simple issue where the performance is surprisingly bad. It is a follow-up to the case under this link, where initially a stoackoverflow issue occurred.

The script below ran for 26 hours in over 8 cores at full calculation as is seen in the hardware statistics.

Of course the object is "quite large"- but in similar operations with such an object it does not take so long. The generated size was 42.5GB in the 8 parquet files on HDFS.

Here is the code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import rand

ss = SparkSession.builder.appName("test_replication") \
.config("spark.kryoserializer.buffer.max.mb", "2047") \
.config('spark.sql.execution.arrow.pyspark.enabled', "true") \
.config("spark.driver.maxResultSize", "16G") \
.config("spark.driver.memory", "4G") \
.config("spark.executor.memory", "16G") \
.config("spark.dynamicAllocation.maxExecutors","8") \
.config("spark.executor.instances", "2") \
.config("spark.executor.cores", "4") \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.driver.extraJavaOptions", "-Xss1024m") \
.config("spark.executor.extraJavaOptions", "-Xss1024m") \
.config("spark.yarn.tags","dev") \
.getOrCreate()

rows=2350000
cols=2500

hdfs_dir="/destination/on/hdfs"

data = ss.range(rows)
for i in range(cols):
data=data.withColumn(f'col{i}', rand() * 2 -1)

data.write.format("parquet").mode("overwrite").save(f"{hdfs_dir}/test.parquet")

Am I doing something wrong?

Edit: I see in the log of the applicationHistory the following element that surprises me - is this normal?

"Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1},

Edit2: Is this due to no cache() or persist() in place?

RangaReddy · ‎06-14-2023

Hi @cirrus

You can find the following optimize code.

/tmp/test_pyspark.py

from pyspark.sql.functions import col, expr
from pyspark.sql import SparkSession
from datetime import datetime
import math

spark = SparkSession.builder \
    .appName('Test App') \
    .getOrCreate()

num_rows = 2350000
num_columns = 2500
records_per_file=5000
num_partitions = int(math.ceil(num_rows/records_per_file))

data = spark.range(num_rows).repartition(num_partitions)
print("Number of Partitions: " + str(data.rdd.getNumPartitions()))

start_time = datetime.now()

data = data.select(*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)])
#data = data.select("*",*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)])

end_time = datetime.now()
delta = end_time - start_time

# time difference in seconds
print("Time difference to select the columns is "+ str(delta.total_seconds()) +" seconds")

start_time = datetime.now()
data.write.format("parquet").mode("overwrite").save("/tmp/test")
end_time = datetime.now()
delta = end_time - start_time

# time difference in seconds
print("Time difference for writing the data to HDFS is "+ str(delta.total_seconds()) +" seconds")

spark.stop()

Spark-submit command:

spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.driver.memory=16G \
--conf spark.driver.memoryOverhead=1g \
--conf spark.executor.memory=16G \
--conf spark.executor.memoryOverhead=1g \
--conf spark.memory.fraction=0.8 \
--conf spark.memory.storageFraction=0.4 \
--conf spark.executor.cores=5 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.extraJavaOptions="-Xss1024m" \
--conf spark.executor.extraJavaOptions="-Xss1024m" /tmp/test_pyspark.py

View solution in original post

cirrus · ‎06-13-2023

I tried to include a cache(), but it still takes that long:

data = ss.range(rows).cache()

I also reduced the following to 32 thinking that I overdid it now maybe with java stack size, but still the same effect

.config("spark.driver.extraJavaOptions", "-Xss32m") \
.config("spark.executor.extraJavaOptions", "-Xss32m") \

@RangaReddy do you have an idea what I am doing wrong?

Edit: I see in the logs the following - is cached actually working or does it show it only at the beginning?

"Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1},

RangaReddy · ‎06-14-2023

Hi @cirrus

You can find the following optimize code.

/tmp/test_pyspark.py

from pyspark.sql.functions import col, expr
from pyspark.sql import SparkSession
from datetime import datetime
import math

spark = SparkSession.builder \
    .appName('Test App') \
    .getOrCreate()

num_rows = 2350000
num_columns = 2500
records_per_file=5000
num_partitions = int(math.ceil(num_rows/records_per_file))

data = spark.range(num_rows).repartition(num_partitions)
print("Number of Partitions: " + str(data.rdd.getNumPartitions()))

start_time = datetime.now()

data = data.select(*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)])
#data = data.select("*",*[expr('rand() * 2 - 1 as col'+str(i)) for i in range(num_columns)])

end_time = datetime.now()
delta = end_time - start_time

# time difference in seconds
print("Time difference to select the columns is "+ str(delta.total_seconds()) +" seconds")

start_time = datetime.now()
data.write.format("parquet").mode("overwrite").save("/tmp/test")
end_time = datetime.now()
delta = end_time - start_time

# time difference in seconds
print("Time difference for writing the data to HDFS is "+ str(delta.total_seconds()) +" seconds")

spark.stop()

Spark-submit command:

spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.driver.memory=16G \
--conf spark.driver.memoryOverhead=1g \
--conf spark.executor.memory=16G \
--conf spark.executor.memoryOverhead=1g \
--conf spark.memory.fraction=0.8 \
--conf spark.memory.storageFraction=0.4 \
--conf spark.executor.cores=5 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.extraJavaOptions="-Xss1024m" \
--conf spark.executor.extraJavaOptions="-Xss1024m" /tmp/test_pyspark.py

Cloudera Community

Support Questions

Performance issue with simple reproducible case

Ranger User Sync Issues due to Case Difference

Simple steps to test Hive JDBC connect

Processing files within the CML EFS (better I/O pe...

SQOOP Performance tuning

Simple example of Jenkins-HDP integration

Spark SQL: Limit clause performance issues

nifi login case sensitivity

Performance Delays in Namenode Caused by Multiple ...

Simple Credit Card Fraud Detection with NiFi, Kafk...

Sqooping Oracle Data simple steps