About barlow

barlow · ‎10-21-2018

Hi Shu, thanks for responding. The solution you provided appears a little difficult for something that I thought would be relatively simple. I will try your solution and let you know how I get on. In the meantime, have you seen the solution provided here: https://forums.databricks.com/questions/2848/how-do-i-create-a-single-csv-file-from-multiple-pa.html?childToView=12091

barlow · ‎10-20-2018

Hello Community, I trying to create a single file from an output query that is overwritten each time query is run. However, I keep on getting multiple part-00001 files. I have tried the following codes. They appear to overwrite the file, but a different filename is generate each time. example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults") example1.coalesce(1).write.option("header","true").mode("overwrite").csv("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput4/newresults/theresults.csv") carl = example1.show() example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json") example1.repartition(1).write.format("csv").mode("overwrite").save("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/thefile.csv") Can someone show me how write code that will result in a single file that is overwritten without changing the filename?

barlow · ‎10-19-2018

Hi guys, I'm sorry if the question seems a little confusing. Basically, I would just like to be able to save to a single file and the file to be overwritten each time it is saved. Thanks

barlow · ‎10-19-2018

Sorry guys, I forgot to add the code: example1 = spark.sql("""SELECT CF.CountryName AS CountryCarsSold ,COUNT(CF.CountryName) AS NumberCountry ,MAX(CB.SalesDetailsID) AS TotalSold FROM Data_SalesDetails CB INNER JOIN Data_Sales CD ON CB.SalesID = CD.SalesID INNER JOIN Data_Customer CG ON CD.CustomerID = CG.CustomerID INNER JOIN Data_Country CF ON CG.Country = CF.CountryISO2 GROUP BY CF.CountryName""") example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")

barlow · ‎10-18-2018

Sorry, I for to add the query, example1 = spark.sql("""SELECT CF.CountryName AS CountryCarsSold ,COUNT(CF.CountryName) AS NumberCountry ,MAX(CB.SalesDetailsID) AS TotalSold FROM Data_SalesDetails CB INNER JOIN Data_Sales CD ON CB.SalesID = CD.SalesID INNER JOIN Data_Customer CG ON CD.CustomerID = CG.CustomerID INNER JOIN Data_Country CF ON CG.Country = CF.CountryISO2 GROUP BY CF.CountryName""") example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")

barlow · ‎10-18-2018

Hello comnunity, I'm using to following script to output the results of sparkql query to a file in Azure Data Store. However, instead creating a file called myresults.json and publishing the results to the myresults.json file, the script publishes the results to a random file name like part-0000-tid ... see image. Can someone let me know how to make sure the file is created and overwritten each time the pyspark query is run? Thanks

barlow · ‎08-13-2018

Hi Sandeep, thanks. It works very well. Thank you

barlow · ‎08-13-2018

Hi Sandeep, I should be clear about what I'm trying to achieve. I would like the output to include only the delta change. I thought that having the current date would be sufficient, but I just realized that having just the currentdate won't let me know if there has been a change to the data. Therefore, while your helping me could you also help me figure out how to include the currentdate and the delta change in data? Much appreciated. Cheers

barlow · ‎08-13-2018

I'm using python version 3 and print(currentate) worked. Thanks. However, when I run the full query I get the following error: ipython-input-22-8c743396e037> in <module>() 18FROMHumanResources_vEmployeeDepartment 19 ORDER BY FirstName, LastName DESC""") ---> 20counts.coalesce(1).write.csvCONCAT("/home/packt/Downloads/myresults7-"+currentdate+".csv") 'DataFrameWriter' object has no attribute 'csvCONCAT'

barlow · ‎08-13-2018

I now get the following error: File "<ipython-input-13-588f4561c3f0>", line 7 print currentdate() ^SyntaxError: invalid syntax The invalid syntax is currentdate() Without the parentheses I get the following error: File "<ipython-input-14-8d268659919b>", line 1 print currentdate ^SyntaxError: Missing parentheses in call to 'print'

Online	Offline
Last Visited	‎08-14-2018 02:05 PM

Member Since	‎08-05-2018 02:01 AM
Last Visited	‎08-14-2018 02:05 PM
Posts	73

Cloudera Community

Re: Unable to Create a single file with PySpark qu...

Unable to Create a single file with PySpark query

Re: How to overwrite a file with pyspark

Re: How to overwrite a file with pyspark

Re: How to overwrite a file with pyspark

How to overwrite a file with pyspark

Re: How to concatenate a date to a filename in pys...

Re: How to concatenate a date to a filename in pys...

Re: How to concatenate a date to a filename in pys...

Re: How to concatenate a date to a filename in pys...