Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Unable to Create a single file with PySpark query

avatar
Explorer

Hello Community,

I trying to create a single file from an output query that is overwritten each time query is run. However, I keep on getting multiple part-00001 files. I have tried the following codes. They appear to overwrite the file, but a different filename is generate each time.

example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults") 
example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults/theresults.csv")
carl = example1.show() 
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")
example1.repartition(1).write.format("csv").mode("overwrite").save("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/thefile.csv")

Can someone show me how write code that will result in a single file that is overwritten without changing the filename?

2 REPLIES 2

avatar
Master Guru

@Carlton Patterson

This is not possible with default save/csv/json functions but using Hadoop API we can rename the filename.

Example:

>>> df=spark.sql("select int(1)id,string('ll')name") //create a dataframe
>>> df.coalesce(1).write.mode("overwrite").csv("/user/shu/test/temp_dir") //writing the df to temp-dir
>>> from py4j.java_gateway import java_import
>>> java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
>>> fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
>>> file = fs.globStatus(sc._jvm.Path('/user/shu/test/temp_dir/part*'))[0].getPath().getName() 	//get 	the filename of temp_dir
>>> fs.rename(sc._jvm.Path('/user/shu/test/temp_dir/' + file),sc._jvm.Path('/user/shu/test/mydata.csv')) //rename the temp directory file with desired filename and directory path
>>> fs.delete(sc._jvm.Path('/user/shu/test/temp_dir'), True) //delete the temp directory.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

avatar
Explorer

Hi Shu, thanks for responding.

The solution you provided appears a little difficult for something that I thought would be relatively simple.

I will try your solution and let you know how I get on.

In the meantime, have you seen the solution provided here:

https://forums.databricks.com/questions/2848/how-do-i-create-a-single-csv-file-from-multiple-pa.html...