Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark : write ordered Dataframe to CSV

Highlighted

Spark : write ordered Dataframe to CSV

Explorer

I'm trying to write an ordered Dataframe/Dataset into multiples CSV Files, and preserve both global and local sort.

 

I have the following code :

df
   .orderBy("date")
   .coalesce(100)
   .write
   .csv(...)

 

Does this code guarantee that :

- I will have 100 output files

- Each single CSV file is locally sorted, I mean by the "date" column ascending
- Files are globally sorted, I mean CSV part-0000 have "date" inferior to CSV part-0001, CSV part-0001 have "date" inferior to CSV part-0002 and so on ..

 

Thanks

1 REPLY 1

Re: Spark : write ordered Dataframe to CSV

Guru
@Plop564

I am not an expert in Spark, but my understand is below:

1. I will have 100 output files
>>> this depends how many partitions you have in your original DF. "coalesce" can only reduce number of partitions, so if you have less than 100 partitions before, then it won't do anything, as "coalesce" does not do shuffling.

If you want to guarantee number of output files, I believe "repartition" function is better.

2. Each single CSV file is locally sorted, I mean by the "date" column ascending
>>> Yes

3. Files are globally sorted, I mean CSV part-0000 have "date" inferior to CSV part-0001, CSV part-0001 have "date" inferior to CSV part-0002 and so on ..
>>> I believe it is also Yes, but will wait for other Spark experts to confirm.

Cheers
Eric
Don't have an account?
Coming from Hortonworks? Activate your account here