Created 07-13-2017 10:49 PM
I am looping over a dataset of 1000 partitions and running operation as I go.
I'm using Spark 2.0 and doing an expensive join for each of the partitions. The join takes less than a second when I call .show but when I try to save the data which is around 59 million, it takes 5 minutes.(tried reparitioning too)
5 minutes * 1000 partitions is 5000 minutes. I cannot wait that long. Any idea on optimizing the saveAsText file performance?
Created 07-14-2017 01:24 AM
Where is data located? Hive, HDFS, can you share your cluster specs, how many nodes (# of cores / RAM)
Created 07-14-2017 09:39 AM
Hi @Adnan Alvee, In order to parrallelize the write to hdfs, you just need to increase the number of partitions for your data and or increase the number of executor.
To increase the number of executor, when you submit your spark job specify the option: --num-executors x where the x is the number of executor that you want. more you have more parallelism your have.
to increase the number of partition, in your code, you have to call the function repartition(x) on the RDD or the Dataset, it will spread the data over x node (container) and each node will write in parallel.
Last thing, don't increase to much the repartition because the result can be that you create too many small file. So I advice, Size of full data (in MB) / 128 = repartition number
Michel
Created 07-14-2017 03:46 PM
NOTES:
Tried different no. executors from 10-60 but performance doesn't improve.
Saving in Parquet format saves 1 minute but I dont want parquet.
Created 07-14-2017 03:47 PM
Created 07-14-2017 04:23 PM
You can add compression when you write your data. This will speed up the saving because the size of the data will smaller. Also increase the number of partition