Support Questions

Find answers, ask questions, and share your expertise

How to integrate kafka to pull data from RDBMS

avatar
Super Collaborator

Hi,Currently we are implementing a POC in which we require to import data from RDBMS.

Previously, we used sqoop to do the same and it was working fine. Currently, need to pull data using kafka for real time processing. How to implement the same

1 ACCEPTED SOLUTION

avatar
Master Guru

Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well.

But lets think how you could implement real-time streaming from a database:

1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible.

2) There are some log replication tools that can integrate with Kafka

http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf

GoldenGate for Java seems to fit the bill.

Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely.

http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

Also pretty realtime.

3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka.

This can be

- a little Java Producer with a JDBC driver

- You could use Storm:

http://storm.apache.org/documentation/storm-jdbc.html

- Spark Streaming might have one as well

Or simply a scheduled job doing the copying perhaps MapReduce

https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html

https://github.com/kafka-dev/kafka/blob/master/contrib/hadoop-producer/src/main/java/kafka/bridge/ha...

- pretty sure flume can do it as well

EDIT: Added sqoop2 for completion sake

Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed.

http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector

Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads.

Hope that helps

View solution in original post

16 REPLIES 16

avatar
Master Guru

Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well.

But lets think how you could implement real-time streaming from a database:

1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible.

2) There are some log replication tools that can integrate with Kafka

http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf

GoldenGate for Java seems to fit the bill.

Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely.

http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

Also pretty realtime.

3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka.

This can be

- a little Java Producer with a JDBC driver

- You could use Storm:

http://storm.apache.org/documentation/storm-jdbc.html

- Spark Streaming might have one as well

Or simply a scheduled job doing the copying perhaps MapReduce

https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html

https://github.com/kafka-dev/kafka/blob/master/contrib/hadoop-producer/src/main/java/kafka/bridge/ha...

- pretty sure flume can do it as well

EDIT: Added sqoop2 for completion sake

Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed.

http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector

Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads.

Hope that helps

avatar
Explorer

If you think real time stream processing can be done without streaming framework like kafka in middle why all your above use case are based on kafka 🙂 Kafka is not just persistent store its a highly scalable messaging queue that can feed data from multiple datasources to your target framework(spark streams, storm ) etc. I wonder how would you directly feed RDBMS data in real time to storm or spark without any middleware messaging system like kafka.

avatar
New Contributor

Hi all, is there any update on open source SQL Server CDC?

avatar
Master Mentor

@Krishna Srinivas What's the source system? oracle ?

avatar
Super Collaborator

Hi Neeraj,

Source System is MSSQL Server

avatar
Super Collaborator

@Neeraj SabharwalSource System is MSSQL Server

avatar
Super Collaborator

Hi Benjamin,

Thank you for the inputs, we are looking for a lambda architecture wherein we would pull the data from RDBMS into kafka and from there for batch processing we would use spark and for streaming we want to use storm.

Currently, we are using sqoop to import data from RDBMS to Hive/Hbase. But in future we wants to implement Kafka to work as the data ingestion tool.

Yeah, I have been going through a lot of forums lately about kafka but i have never read about any ingestion from DB.

Also, Can we integrate sqoop and Kafka to work together. Implementing incremental import from RDBMS using sqoop to kafka and providing the same to spark for batch processing and updating to Hive Tables from there

avatar
Master Guru

Read what I wrote I think it contains links to most possible approaches to connect a DB to Kafka.

Writing a sqoop extension that directly writes into Kafka might be possible but I suppose its more work than just to write the MapReduce job yourself using the DBInputFormat and KafkaOutputformat.

If you want to use Spark for batch I would read from the same Kafka topic I would use for realtime ( storm,spark streaming) if you use spark streaming you can just use different timeframes.

avatar
Master Guru

I take the sqoop part back. Someone is actually working on it. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So you would have to manually install it.

http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector