Support Questions
Find answers, ask questions, and share your expertise

pyspark job running slow for reading data from cassandra and writing into snowflake

pyspark job running slow for reading data from cassandra and writing into snowflake

I am reading data from cassandra table ( table size 7 GB ) and writing into aws s3 in parquet file .

Job is taking 20 min and it writes 3 GB data in aws s3 in parquet format,

below are configurations:

total executors  in the cluster -  28 vcores

total memory in the cluster - 52 GB

number of nodes in the cluster - 3 

per node capacity - 8 vcores and 15 GB ram

executors per core - 2

memory per executor - 2G

driver cores - 4

driver memory - 4G

 

spark cassandra read :

data = spark_session.read \
.format('org.apache.spark.sql.cassandra') \
.options(table=table, keyspace=keyspace) \
.load()

 

data write into aws s3:

data .write.parquet(aws_path,mode="overwrite")