Support Questions

Plop564 · ‎10-14-2019

I'm trying to write an ordered Dataframe/Dataset into multiples CSV Files, and preserve both global and local sort.

I have the following code :

df
   .orderBy("date")
   .coalesce(100)
   .write
   .csv(...)

Does this code guarantee that :

- I will have 100 output files

- Each single CSV file is locally sorted, I mean by the "date" column ascending
- Files are globally sorted, I mean CSV part-0000 have "date" inferior to CSV part-0001, CSV part-0001 have "date" inferior to CSV part-0002 and so on ..

Thanks

EricL · ‎10-14-2019

@Plop564

I am not an expert in Spark, but my understand is below:

1. I will have 100 output files
>>> this depends how many partitions you have in your original DF. "coalesce" can only reduce number of partitions, so if you have less than 100 partitions before, then it won't do anything, as "coalesce" does not do shuffling.

If you want to guarantee number of output files, I believe "repartition" function is better.

2. Each single CSV file is locally sorted, I mean by the "date" column ascending
>>> Yes

3. Files are globally sorted, I mean CSV part-0000 have "date" inferior to CSV part-0001, CSV part-0001 have "date" inferior to CSV part-0002 and so on ..
>>> I believe it is also Yes, but will wait for other Spark experts to confirm.

Cheers
Eric

Cloudera Community

Support Questions

Spark : write ordered Dataframe to CSV