Support Questions

adnanalvee · ‎07-13-2017

I am looping over a dataset of 1000 partitions and running operation as I go.

I'm using Spark 2.0 and doing an expensive join for each of the partitions. The join takes less than a second when I call .show but when I try to save the data which is around 59 million, it takes 5 minutes.(tried reparitioning too)

5 minutes * 1000 partitions is 5000 minutes. I cannot wait that long. Any idea on optimizing the saveAsText file performance?

mqadri · ‎07-14-2017

Where is data located? Hive, HDFS, can you share your cluster specs, how many nodes (# of cores / RAM)

msumbul1 · ‎07-14-2017

Hi @Adnan Alvee, In order to parrallelize the write to hdfs, you just need to increase the number of partitions for your data and or increase the number of executor.

To increase the number of executor, when you submit your spark job specify the option: --num-executors x where the x is the number of executor that you want. more you have more parallelism your have.

to increase the number of partition, in your code, you have to call the function repartition(x) on the RDD or the Dataset, it will spread the data over x node (container) and each node will write in parallel.

Last thing, don't increase to much the repartition because the result can be that you create too many small file. So I advice, Size of full data (in MB) / 128 = repartition number

Michel

adnanalvee · ‎07-14-2017

NOTES:

Tried different no. executors from 10-60 but performance doesn't improve.

Saving in Parquet format saves 1 minute but I dont want parquet.

adnanalvee · ‎07-14-2017

@Joe Widen

@Timothy Spann

msumbul1 · ‎07-14-2017

You can add compression when you write your data. This will speed up the saving because the size of the data will smaller. Also increase the number of partition

Cloudera Community

Support Questions

Saving data to HDFS taking too long