Support Questions

Find answers, ask questions, and share your expertise

How to integrate kafka to pull data from RDBMS

avatar
Super Collaborator

Hi,Currently we are implementing a POC in which we require to import data from RDBMS.

Previously, we used sqoop to do the same and it was working fine. Currently, need to pull data using kafka for real time processing. How to implement the same

1 ACCEPTED SOLUTION

avatar
Master Guru

Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well.

But lets think how you could implement real-time streaming from a database:

1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible.

2) There are some log replication tools that can integrate with Kafka

http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf

GoldenGate for Java seems to fit the bill.

Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely.

http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

Also pretty realtime.

3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka.

This can be

- a little Java Producer with a JDBC driver

- You could use Storm:

http://storm.apache.org/documentation/storm-jdbc.html

- Spark Streaming might have one as well

Or simply a scheduled job doing the copying perhaps MapReduce

https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html

https://github.com/kafka-dev/kafka/blob/master/contrib/hadoop-producer/src/main/java/kafka/bridge/ha...

- pretty sure flume can do it as well

EDIT: Added sqoop2 for completion sake

Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed.

http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector

Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads.

Hope that helps

View solution in original post

16 REPLIES 16

avatar
Master Mentor
@Krishna Srinivas

I am sure have seen this

Sqoop to Kakfa - I don't think so or I have not seen this integration.

You can build data ingestion from SqlServer to Kafka as disucssed in the above link.

avatar
Master Mentor

@Krishna Srinivas

take a look at nifi, you can sqoop into a spooling dir, have Kafka pick up from there on. Spark streaming in nifi already exists and Storm is going to be included soon. Rough idea of your last inquiry

Sqoop incremental into hdfs directory > watch hdfs dir with nifi > putKafka > Stormspark

You can also split to two pipes in nifi and join into one pipe from two

avatar
Master Mentor

Look at this example Link for spark streaming an this example for Kafka Link @Krishna Srinivas

avatar
New Contributor

@Krishna Srinivas

You can use Streamsets Data Collector, which is open source, to read from JDBC Databases and put to Kafka topics.

avatar
New Contributor

Guys,

I would prefer Kafka only when data is pushing from an external system.

and another place where I use Kafka, pulled data will be used by multiple parties .so that each consumer connects to kafka topic.

when you have control to pull the data then you can go for custom receivers in Spark. pull what you can consume.

which avoids the extra overhead of maintaining Kafka cluster for balancing the load.

Regards,

@Ram.

avatar
Explorer

Currently we are implementing a POC in which we require to import real time data from RDBMS to Kafka using Attunity..How to implement the same

avatar
New Contributor

Hi all, is there any update on open source CDC tool for SQL Server?