Support Questions

sean_byrnee · ‎03-16-2017

Is it possible to append to a destination file when using writestream in Spark 2.

Example:

I've got a Kafka topic and a stream running and consuming data as it is written to the topic. I want to perform some transformations and append to an existing csv file (this can be local for now, but eventually I'd want this to be on hdfs).

query = csv_select \
 .writeStream \
 .format("csv") \
 .option("format", "append") \	
 .option("path", "/destination_path/") \
 .option("checkpointLocation", "/checkpoint_path") \
 .outputMode("append") \
 .start()

This is the code I'm using and when i produce new data to the topic, it outputs as a new csv file in /destination_path. Any advice on how to append to an existing file rather than create a new one every time?

NOTE: I'm using csv for simplicity here, but if you need to show an example in parquet or some other format that would be helpful too.

EDIT: The example I'm working from is here: https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1...

dzaratsian · ‎03-17-2017

Hi @Sean Byrne

I also had a similar question, but it's common within distributed systems to see many "part" file outputs. This is because you will typically have many partitions, across multiple nodes, writing to the same output directory (so interference is reduced).

However, you can run a Spark job against this directory in order to create one single CSV file. Here's the code:

# Use PySpark to read in all "part" files
allfiles =  spark.read.option("header","false").csv("/destination_path/part-*.csv")

# Output as CSV file
allfiles.coalesce(1).write.format("csv").option("header", "false").save("/destination_path/single_csv_file/")

Another option would be to use format("memory") and then you could execute periodic in-memory queries against the Spark Stream. These queries could save the in-memory table to a single CSV (or other format).

If I come across any way to output to a single CSV from Structure Streaming, I will be sure to post it. Hope this is helpful.

View solution in original post

dzaratsian · ‎03-17-2017

Hi @Sean Byrne

I also had a similar question, but it's common within distributed systems to see many "part" file outputs. This is because you will typically have many partitions, across multiple nodes, writing to the same output directory (so interference is reduced).

However, you can run a Spark job against this directory in order to create one single CSV file. Here's the code:

# Use PySpark to read in all "part" files
allfiles =  spark.read.option("header","false").csv("/destination_path/part-*.csv")

# Output as CSV file
allfiles.coalesce(1).write.format("csv").option("header", "false").save("/destination_path/single_csv_file/")

Another option would be to use format("memory") and then you could execute periodic in-memory queries against the Spark Stream. These queries could save the in-memory table to a single CSV (or other format).

If I come across any way to output to a single CSV from Structure Streaming, I will be sure to post it. Hope this is helpful.

Cloudera Community

Support Questions

Structured Streaming writestream append to file

Spark Structured Streaming example with CDE

Spark Structured Streaming with NiFi and Kafka (us...

Memory usage of state in Spark Structured Streamin...

CDP DataHub - PySpark Structured Streaming reading...

Spak structured streaming job failed

Spark Streaming Graceful Shutdown - Part2

HDP 2.6.4 - HDF 3.1: Apache Kafka - Apache Spark S...

Solace Integration with Spark Structured Streaming

HDP3.0: spark structured streaming jobs working in...

Spark Structured Streaming vs NIFI