I trying to create a single file from an output query that is overwritten each time query is run. However, I keep on getting multiple part-00001 files. I have tried the following codes. They appear to overwrite the file, but a different filename is generate each time.
example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults/theresults.csv") carl = example1.show()
Can someone show me how write code that will result in a single file that is overwritten without changing the filename?
This is not possible with default save/csv/json functions but using Hadoop API we can rename the filename.
>>> df=spark.sql("select int(1)id,string('ll')name") //create a dataframe >>> df.coalesce(1).write.mode("overwrite").csv("/user/shu/test/temp_dir") //writing the df to temp-dir >>> from py4j.java_gateway import java_import >>> java_import(spark._jvm, 'org.apache.hadoop.fs.Path') >>> fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) >>> file = fs.globStatus(sc._jvm.Path('/user/shu/test/temp_dir/part*')).getPath().getName() //get the filename of temp_dir >>> fs.rename(sc._jvm.Path('/user/shu/test/temp_dir/' + file),sc._jvm.Path('/user/shu/test/mydata.csv')) //rename the temp directory file with desired filename and directory path >>> fs.delete(sc._jvm.Path('/user/shu/test/temp_dir'), True) //delete the temp directory.
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
Hi Shu, thanks for responding.
The solution you provided appears a little difficult for something that I thought would be relatively simple.
I will try your solution and let you know how I get on.
In the meantime, have you seen the solution provided here: