I am just starting to work with Spark Streaming and Kafka and i am trying to find what is the best approach for this kind of scenario:
- spark application is running getting Kafka messages
- kafka messages are nested JSON
- after processing the message in spark, some dataframes are created to handle attributes that are objects
and here is my doubt, how to dump this rdd's into several database tables keeping in mind that if for some reason one of the rdd gives an error we should rollback all the information.
Use staging tables and at the end of processing mark data as valid the data?
What do you think is the best approach?
Hi, You might be able to do this if your destination database and its JDBC driver supports transactions (rollbacks and commits).
I think the JDBC allows transactions, the problem i see is if i can broadcast the transaction to the spark workers. What i am trying to do is to abort everything inserted if something fails.
I don't think this can be done in Spark. You have to use JDBC API style syntax (import java.sql.* ) for this and wrap your DML statements inside a transaction ie. setAutocommit = false, commit if everything is OK, rollback everything if any one of the DML statements fails.
I am trying a different approach but at the end it's similar to what you said.
I will expand the json and create a single dataframe, then dump the dataframe to database and execute a single statement using transactions.
What do you think?