Support Questions

Find answers, ask questions, and share your expertise

pyspark - creating directory when trying to RDD as saveAsTextFile

avatar
Super Collaborator

Hi Team,

I am using python spark 1.6

trying to save RDD to hdfs path, when I am giving command rdd.saveAsTextFile('/path/finename.txt') it is saving directory and files in it.

It is creating directory in specified path instead of creating file

Can you suggest a way to create a file in hdfs specified path instead of directory and files in it.

4 REPLIES 4

avatar
Super Guru

@Viswa,

rdd.saveAsTextFile will accept the path as input and will create part files inside the folder. If you want to write output to a side file inside the folder then you can use

rdd.coalesce(1).saveAsTextFile('/path/finename.txt')

Thanks,

Aditya

avatar
Super Collaborator

@Aditya Sirna

thank you for your response.

I have tried that too, this way it is still creating a folder with name 'finename.txt' and one single file 'part-00000' file in the folder.

Kindly let me know if there is a way to create a file directly on specified path

avatar
Super Guru

@Viswa,

I'm not aware of any way to create as a file directly. Only option I can think of is to create a single part file and rename it as required.

avatar

lets think of basics.

RDD is being saved , which is a distributed across machines and hence, if all of them start writing to same file in HDFS , one can only append and write will undergo huge number of locks as multiple clients are writing at the same time. Its a classical case of distributed concurrent clients trying to write to a file ( imagine multiple threads write to same log file). That´s the reason a directory is made and individual task write in their own file. Collectively all the files present in your output directory is the output of your Job.

Solutions :

1. rdd.coalesce(1).saveAsTextFile('/path/outputdir'), and then In your driver use hdfs mv to move part-0000 to finename.txt.

2. assuming data is less ( as you want to write to a single file ) perform a rdd.collect() and write on to hdfs in the driver , by getting a hdfs handler.