I am using python spark 1.6
trying to save RDD to hdfs path, when I am giving command rdd.saveAsTextFile('/path/finename.txt') it is saving directory and files in it.
It is creating directory in specified path instead of creating file
Can you suggest a way to create a file in hdfs specified path instead of directory and files in it.
rdd.saveAsTextFile will accept the path as input and will create part files inside the folder. If you want to write output to a side file inside the folder then you can use
thank you for your response.
I have tried that too, this way it is still creating a folder with name 'finename.txt' and one single file 'part-00000' file in the folder.
Kindly let me know if there is a way to create a file directly on specified path
lets think of basics.
RDD is being saved , which is a distributed across machines and hence, if all of them start writing to same file in HDFS , one can only append and write will undergo huge number of locks as multiple clients are writing at the same time. Its a classical case of distributed concurrent clients trying to write to a file ( imagine multiple threads write to same log file). That´s the reason a directory is made and individual task write in their own file. Collectively all the files present in your output directory is the output of your Job.
1. rdd.coalesce(1).saveAsTextFile('/path/outputdir'), and then In your driver use hdfs mv to move part-0000 to finename.txt.
2. assuming data is less ( as you want to write to a single file ) perform a rdd.collect() and write on to hdfs in the driver , by getting a hdfs handler.