Support Questions

barlow · ‎10-20-2018

Hello Community,

I trying to create a single file from an output query that is overwritten each time query is run. However, I keep on getting multiple part-00001 files. I have tried the following codes. They appear to overwrite the file, but a different filename is generate each time.

example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults")

example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults/theresults.csv")
carl = example1.show()

example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")

example1.repartition(1).write.format("csv").mode("overwrite").save("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/thefile.csv")

Can someone show me how write code that will result in a single file that is overwritten without changing the filename?

Shu_ashu · ‎10-20-2018

@Carlton Patterson

This is not possible with default save/csv/json functions but using Hadoop API we can rename the filename.

Example:

>>> df=spark.sql("select int(1)id,string('ll')name") //create a dataframe
>>> df.coalesce(1).write.mode("overwrite").csv("/user/shu/test/temp_dir") //writing the df to temp-dir
>>> from py4j.java_gateway import java_import
>>> java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
>>> fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
>>> file = fs.globStatus(sc._jvm.Path('/user/shu/test/temp_dir/part*'))[0].getPath().getName() 	//get 	the filename of temp_dir
>>> fs.rename(sc._jvm.Path('/user/shu/test/temp_dir/' + file),sc._jvm.Path('/user/shu/test/mydata.csv')) //rename the temp directory file with desired filename and directory path
>>> fs.delete(sc._jvm.Path('/user/shu/test/temp_dir'), True) //delete the temp directory.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

barlow · ‎10-21-2018

Hi Shu, thanks for responding.

The solution you provided appears a little difficult for something that I thought would be relatively simple.

I will try your solution and let you know how I get on.

In the meantime, have you seen the solution provided here:

https://forums.databricks.com/questions/2848/how-do-i-create-a-single-csv-file-from-multiple-pa.html...

Cloudera Community

Support Questions

Unable to Create a single file with PySpark query

Create Hive table using pyspark: Mkdirs failed to...

Hbase filter query using pyspark

Error creating done directory: [file:/user/history...

Performance diff between single big file vs multip...

Uploading Files for Cloudera Support - alternate m...

Merge small files in pyspark for Hive table

Unable to run queries from Hive LLAP queue

HDFS Recovery Time from Single DataNode Failure

hive Insert to Dynamic Partition query Generating ...

Hive 3 export single ORC file