Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unable to Create a single file with PySpark query

Highlighted

Unable to Create a single file with PySpark query

New Contributor

Hello Community,

I trying to create a single file from an output query that is overwritten each time query is run. However, I keep on getting multiple part-00001 files. I have tried the following codes. They appear to overwrite the file, but a different filename is generate each time.

example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults") 
example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults/theresults.csv")
carl = example1.show() 
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")
example1.repartition(1).write.format("csv").mode("overwrite").save("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/thefile.csv")

Can someone show me how write code that will result in a single file that is overwritten without changing the filename?

2 REPLIES 2

Re: Unable to Create a single file with PySpark query

Super Guru

@Carlton Patterson

This is not possible with default save/csv/json functions but using Hadoop API we can rename the filename.

Example:

>>> df=spark.sql("select int(1)id,string('ll')name") //create a dataframe
>>> df.coalesce(1).write.mode("overwrite").csv("/user/shu/test/temp_dir") //writing the df to temp-dir
>>> from py4j.java_gateway import java_import
>>> java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
>>> fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
>>> file = fs.globStatus(sc._jvm.Path('/user/shu/test/temp_dir/part*'))[0].getPath().getName() 	//get 	the filename of temp_dir
>>> fs.rename(sc._jvm.Path('/user/shu/test/temp_dir/' + file),sc._jvm.Path('/user/shu/test/mydata.csv')) //rename the temp directory file with desired filename and directory path
>>> fs.delete(sc._jvm.Path('/user/shu/test/temp_dir'), True) //delete the temp directory.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Re: Unable to Create a single file with PySpark query

New Contributor

Hi Shu, thanks for responding.

The solution you provided appears a little difficult for something that I thought would be relatively simple.

I will try your solution and let you know how I get on.

In the meantime, have you seen the solution provided here:

https://forums.databricks.com/questions/2848/how-do-i-create-a-single-csv-file-from-multiple-pa.html...

Don't have an account?
Coming from Hortonworks? Activate your account here