Reply
Explorer
Posts: 10
Registered: ‎08-23-2017

Storing Spark Streaming Data in HDFS and/or Kudu

Hi,

 

How do I store Spark streaming data into:

 

1. HDFS

2. Kudu

 

I am following below example:

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py

 

I am using Spark 2.2 (also have Spark 1.6 installed). I am using Spark Streaming with Kafka where Spark streaming is acting as a consumer.

 

Can you please tell how to store Spark Streaming data into HDFS using:

1. Spark Streaming

2. Structured Streaming

 

I am using pyspark.

 

 

 

Thanks you.

Explorer
Posts: 24
Registered: ‎06-13-2017

Re: Storing Spark Streaming Data in HDFS and/or Kudu

I don't think there is Kudu support yet in Pyspark. see KUDU-1603

 

Explorer
Posts: 10
Registered: ‎08-23-2017

Re: Storing Spark Streaming Data in HDFS and/or Kudu

How do I store spark structured streaming data into HDFS? 

Explorer
Posts: 8
Registered: ‎04-26-2017

Re: Storing Spark Streaming Data in HDFS and/or Kudu

Hi,

 

You need to generate an RDD of structured data and write it to HDFS. Sample code in java is as follows,

 

records.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() {
private static final long serialVersionUID = 1L;

@Override
public void call(JavaRDD<String> rdd, Time time) throws Exception {
if (rdd.count() > 0) {
rdd.saveAsTextFile(outputPath + "/" + time.milliseconds());
}
}
});

Hope this helps.

 

Thanks,

Ravi

Announcements