Created 02-01-2016 09:19 AM
Hi,Currently we are implementing a POC in which we require to import data from RDBMS.
Previously, we used sqoop to do the same and it was working fine. Currently, need to pull data using kafka for real time processing. How to implement the same
Created 02-01-2016 10:19 AM
Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well.
But lets think how you could implement real-time streaming from a database:
1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible.
2) There are some log replication tools that can integrate with Kafka
http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf
GoldenGate for Java seems to fit the bill.
Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely.
http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
Also pretty realtime.
3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka.
This can be
- a little Java Producer with a JDBC driver
- You could use Storm:
http://storm.apache.org/documentation/storm-jdbc.html
- Spark Streaming might have one as well
Or simply a scheduled job doing the copying perhaps MapReduce
https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html
- pretty sure flume can do it as well
EDIT: Added sqoop2 for completion sake
Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed.
http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector
Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads.
Hope that helps
Created 02-01-2016 10:19 AM
Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well.
But lets think how you could implement real-time streaming from a database:
1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible.
2) There are some log replication tools that can integrate with Kafka
http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf
GoldenGate for Java seems to fit the bill.
Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely.
http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
Also pretty realtime.
3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka.
This can be
- a little Java Producer with a JDBC driver
- You could use Storm:
http://storm.apache.org/documentation/storm-jdbc.html
- Spark Streaming might have one as well
Or simply a scheduled job doing the copying perhaps MapReduce
https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html
- pretty sure flume can do it as well
EDIT: Added sqoop2 for completion sake
Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed.
http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector
Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads.
Hope that helps
Created 10-05-2016 08:51 PM
If you think real time stream processing can be done without streaming framework like kafka in middle why all your above use case are based on kafka 🙂 Kafka is not just persistent store its a highly scalable messaging queue that can feed data from multiple datasources to your target framework(spark streams, storm ) etc. I wonder how would you directly feed RDBMS data in real time to storm or spark without any middleware messaging system like kafka.
Created 02-08-2019 08:02 AM
Hi all, is there any update on open source SQL Server CDC?
Created 02-01-2016 11:18 AM
@Krishna Srinivas What's the source system? oracle ?
Created 02-01-2016 12:33 PM
Hi Neeraj,
Source System is MSSQL Server
Created 02-01-2016 12:37 PM
@Neeraj SabharwalSource System is MSSQL Server
Created 02-01-2016 12:32 PM
Hi Benjamin,
Thank you for the inputs, we are looking for a lambda architecture wherein we would pull the data from RDBMS into kafka and from there for batch processing we would use spark and for streaming we want to use storm.
Currently, we are using sqoop to import data from RDBMS to Hive/Hbase. But in future we wants to implement Kafka to work as the data ingestion tool.
Yeah, I have been going through a lot of forums lately about kafka but i have never read about any ingestion from DB.
Also, Can we integrate sqoop and Kafka to work together. Implementing incremental import from RDBMS using sqoop to kafka and providing the same to spark for batch processing and updating to Hive Tables from there
Created 02-01-2016 12:57 PM
Read what I wrote I think it contains links to most possible approaches to connect a DB to Kafka.
Writing a sqoop extension that directly writes into Kafka might be possible but I suppose its more work than just to write the MapReduce job yourself using the DBInputFormat and KafkaOutputformat.
If you want to use Spark for batch I would read from the same Kafka topic I would use for realtime ( storm,spark streaming) if you use spark streaming you can just use different timeframes.
Created 02-01-2016 01:17 PM
I take the sqoop part back. Someone is actually working on it. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So you would have to manually install it.
http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector