Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Storing Spark Streaming Data in HDFS and/or Kudu

Storing Spark Streaming Data in HDFS and/or Kudu

Explorer

Hi,

 

How do I store Spark streaming data into:

 

1. HDFS

2. Kudu

 

I am following below example:

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py

 

I am using Spark 2.2 (also have Spark 1.6 installed). I am using Spark Streaming with Kafka where Spark streaming is acting as a consumer.

 

Can you please tell how to store Spark Streaming data into HDFS using:

1. Spark Streaming

2. Structured Streaming

 

I am using pyspark.

 

 

 

Thanks you.

4 REPLIES 4
Highlighted

Re: Storing Spark Streaming Data in HDFS and/or Kudu

Explorer

I don't think there is Kudu support yet in Pyspark. see KUDU-1603

 

Re: Storing Spark Streaming Data in HDFS and/or Kudu

Explorer

How do I store spark structured streaming data into HDFS? 

Re: Storing Spark Streaming Data in HDFS and/or Kudu

Explorer

Hi,

 

You need to generate an RDD of structured data and write it to HDFS. Sample code in java is as follows,

 

records.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>() {
private static final long serialVersionUID = 1L;

@Override
public void call(JavaRDD<String> rdd, Time time) throws Exception {
if (rdd.count() > 0) {
rdd.saveAsTextFile(outputPath + "/" + time.milliseconds());
}
}
});

Hope this helps.

 

Thanks,

Ravi

Re: Storing Spark Streaming Data in HDFS and/or Kudu

New Contributor

God bless you, munna143