Support Questions

barlow · ‎10-18-2018

Hello comnunity,

I'm using to following script to output the results of sparkql query to a file in Azure Data Store. However, instead creating a file called myresults.json and publishing the results to the myresults.json file, the script publishes the results to a random file name like part-0000-tid ... see image.

Can someone let me know how to make sure the file is created and overwritten each time the pyspark query is run?

Thanks

barlow · ‎10-18-2018

Sorry, I for to add the query,

example1 = spark.sql("""SELECT
  CF.CountryName AS CountryCarsSold
 ,COUNT(CF.CountryName) AS NumberCountry
 ,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
  ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
  ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
  ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")

barlow · ‎10-19-2018

Sorry guys, I forgot to add the code:

example1 = spark.sql("""SELECT
  CF.CountryName AS CountryCarsSold
 ,COUNT(CF.CountryName) AS NumberCountry
 ,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
  ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
  ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
  ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")

barlow · ‎10-19-2018

Hi guys, I'm sorry if the question seems a little confusing. Basically, I would just like to be able to save to a single file and the file to be overwritten each time it is saved.

Thanks

Cloudera Community

Support Questions

How to overwrite a file with pyspark

putFile Overwrite File

pyspark read file

Using VirtualEnv with PySpark

Apache SPARK - Overwrite data file

Using VirtualEnv with PySpark

Running PySpark with Conda Env

Spark (PySpark) to extract from SQL Server

Uploading Files for Cloudera Support - alternate m...

If SELECT return no rows, "INSERT OVERWRITE" of Hi...

Merge small files in pyspark for Hive table