Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Streaming + Kafka + Multiple database tables for output

Spark Streaming + Kafka + Multiple database tables for output

New Contributor

Hi all,

 

I am just starting to work with Spark Streaming and Kafka and i am trying to find what is the best approach for this kind of scenario:

 

- spark application is running getting Kafka messages

- kafka messages are nested JSON

- after processing the message in spark, some dataframes are created to handle attributes that are objects

 

and here is my doubt, how to dump this rdd's into several database tables keeping in mind that if for some reason one of the rdd gives an error we should rollback all the information.

 

Use staging tables and at the end of processing mark data as valid the data? 

 

What do you think is the best approach?

 

Best regards.

5 REPLIES 5

Re: Spark Streaming + Kafka + Multiple database tables for output

Explorer

Hi, You might be able to do this if your destination database and its JDBC driver supports transactions (rollbacks and commits).

 

 

Re: Spark Streaming + Kafka + Multiple database tables for output

New Contributor

Hi vanhalen,

 

I think the JDBC allows transactions, the problem i see is if i can broadcast the transaction to the spark workers. What i am trying to do is to abort everything inserted if something fails.

Highlighted

Re: Spark Streaming + Kafka + Multiple database tables for output

Explorer

I don't think this can be done in Spark. You have to use JDBC API style syntax (import java.sql.* ) for this and wrap your DML statements inside a transaction ie. setAutocommit = false, commit if everything is OK, rollback everything if any one of the DML statements fails.

Re: Spark Streaming + Kafka + Multiple database tables for output

New Contributor

I am trying a different approach but at the end it's similar to what you said.

I will expand the json and create a single dataframe, then dump the dataframe to database and execute a single statement using transactions.

 

What do you think?

 

Thanks.

Best regards.

Re: Spark Streaming + Kafka + Multiple database tables for output

Explorer

Sounds good to me.