Support Questions

barlow · ‎10-18-2018

Hello comnunity,

I'm using to following script to output the results of sparkql query to a file in Azure Data Store. However, instead creating a file called myresults.json and publishing the results to the myresults.json file, the script publishes the results to a random file name like part-0000-tid ... see image.

Can someone let me know how to make sure the file is created and overwritten each time the pyspark query is run?

Thanks

barlow · ‎10-18-2018

Sorry, I for to add the query,

example1 = spark.sql("""SELECT
  CF.CountryName AS CountryCarsSold
 ,COUNT(CF.CountryName) AS NumberCountry
 ,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
  ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
  ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
  ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")

barlow · ‎10-19-2018

Sorry guys, I forgot to add the code:

example1 = spark.sql("""SELECT
  CF.CountryName AS CountryCarsSold
 ,COUNT(CF.CountryName) AS NumberCountry
 ,MAX(CB.SalesDetailsID) AS TotalSold
FROM Data_SalesDetails CB
INNER JOIN Data_Sales CD
  ON CB.SalesID = CD.SalesID
INNER JOIN Data_Customer CG
  ON CD.CustomerID = CG.CustomerID
INNER JOIN Data_Country CF
  ON CG.Country = CF.CountryISO2
GROUP BY CF.CountryName""")
example1.coalesce(1).write.mode("append").json("adl://carlslake.azuredatalakestore.net/jfolder2/outputfiles/myoutput/myresults.json")

barlow · ‎10-19-2018

Hi guys, I'm sorry if the question seems a little confusing. Basically, I would just like to be able to save to a single file and the file to be overwritten each time it is saved.

Thanks

Cloudera Community

Support Questions

How to overwrite a file with pyspark

Apache SPARK - Overwrite data file

Spark (PySpark) for ETL to join text files with My...

Distributed XGBoost with PySpark in Cloudera Machi...

Pyspark Streaming Wordcount Example

Spark (PySpark) to extract from SQL Server

Uploading Files for Cloudera Support - alternate m...

If SELECT return no rows, "INSERT OVERWRITE" of Hi...

Issue when using PySpark with Impala via JDBC

Merge small files in pyspark for Hive table

Create Hive table using pyspark: Mkdirs failed to...