Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Unable to Create a single file with PySpark query


Hello Community,

I trying to create a single file from an output query that is overwritten each time query is run. However, I keep on getting multiple part-00001 files. I have tried the following codes. They appear to overwrite the file, but a different filename is generate each time.

carl = 

Can someone show me how write code that will result in a single file that is overwritten without changing the filename?


Super Guru

@Carlton Patterson

This is not possible with default save/csv/json functions but using Hadoop API we can rename the filename.


>>> df=spark.sql("select int(1)id,string('ll')name") //create a dataframe
>>> df.coalesce(1).write.mode("overwrite").csv("/user/shu/test/temp_dir") //writing the df to temp-dir
>>> from py4j.java_gateway import java_import
>>> java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
>>> fs =
>>> file = fs.globStatus(sc._jvm.Path('/user/shu/test/temp_dir/part*'))[0].getPath().getName() 	//get 	the filename of temp_dir
>>> fs.rename(sc._jvm.Path('/user/shu/test/temp_dir/' + file),sc._jvm.Path('/user/shu/test/mydata.csv')) //rename the temp directory file with desired filename and directory path
>>> fs.delete(sc._jvm.Path('/user/shu/test/temp_dir'), True) //delete the temp directory.


If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.


Hi Shu, thanks for responding.

The solution you provided appears a little difficult for something that I thought would be relatively simple.

I will try your solution and let you know how I get on.

In the meantime, have you seen the solution provided here:

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.