Support Questions

kunni_prudhvi · ‎05-01-2018

When Spark uses Hadoop writer to write part-file (using saveAsTextFile()), "part-NNNNN" is the general format it saves the file in. How can I retrieve this suffix "NNNNN" in Spark at runtime?

Ps. I do not want to list the files and then retrieve the suffix.

kunni_prudhvi · ‎05-01-2018

Any suggestions?

br_gilvan · ‎05-01-2018

Hi @Prudhvi Rao Shedimbi,

What exactly is your need?

I ask this becouse if you simple want to read saved file is only necessary that you set the folder and all content will be read.

sc.textFile("foldername/*")

So, if what you want is write one unique file, from a previous processing of a DataFrame then you can do this using the "df.repartition(1).saveAsTextFile('HDFSFolder/FileName')" and so, only one file "part-00000" will be generated.

If you are using a library like DataBricks you can do so:

df.write.format("csv").save("/HDFSFolder/FileName.csv")

That's it?

kunni_prudhvi · ‎05-01-2018

I'm not trying to read it, I just want to know the complete name of the part-file at runtime in Spark, once a reducer saves it.

br_gilvan · ‎05-02-2018

Aways that you perform a save opperation the files will be created acording the number of partition of you DF, and this process generate files names same "part-xxxxx", so, this is the complete file name.

The file name never will be different this. The variable is how many files will be generated.

So sorry if I understand you desire.

Cloudera Community

Support Questions

Spark - Get part-file suffix (part-NNNNN)