Created 02-01-2016 09:19 AM
Hi,Currently we are implementing a POC in which we require to import data from RDBMS.
Previously, we used sqoop to do the same and it was working fine. Currently, need to pull data using kafka for real time processing. How to implement the same
Created 02-01-2016 10:19 AM
Kafka itself doesn't pull any data. It is a data persistence store. One question: Why do you need Kafka? It is a great persistence store and a great input layer for Storm/Spark Streaming because of its replay capabilities. However databases have similar characteristics. So you normally should be able to directly connect to the RDBMS with Storm/Spark as well.
But lets think how you could implement real-time streaming from a database:
1) Best way IMO: push data into Kafka at the same time you put it in the database. I.e. don't pull it OUT of the DB, push it in Kafka at the same time you put it into the DB. ( for example by adding a second hook to the web app that writes the data. ) You can then use Kafka for all analytics, use it as a source for your warehouse and realtime analytics and you do not need to do the ETL that is normally needed on the transactional db. Its also as realtime as it gets. However this is not always possible.
2) There are some log replication tools that can integrate with Kafka
http://www.oracle.com/us/products/middleware/data-integration/goldengate-for-big-data-ds-2415102.pdf
GoldenGate for Java seems to fit the bill.
Edit: The Kafka guys have an example for postgres using a log replication tool called bottled water. This is the same approach. The article also explains the problem nicely.
http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
Also pretty realtime.
3) Use some batched service that runs every x seconds/minutes and runs a SQL command that loads data with a new timestamp/ unique id and puts it into Kafka.
This can be
- a little Java Producer with a JDBC driver
- You could use Storm:
http://storm.apache.org/documentation/storm-jdbc.html
- Spark Streaming might have one as well
Or simply a scheduled job doing the copying perhaps MapReduce
https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html
- pretty sure flume can do it as well
EDIT: Added sqoop2 for completion sake
Sqoop has a Kafka connector. But its only available in sqoop2 not sqoop1. Unfortunately HDP doesn't currently support sqoop2. So it would have to be manually installed.
http://sqoop2.readthedocs.org/en/latest/Connectors.html#kafka-connector
Obviously the last is not really realtime and the question is if you use storm or Spark why you would need kafka in the middle. Since the db is already a persisted store that can replay loads.
Hope that helps
Created 02-01-2016 12:41 PM
I am sure have seen this
Sqoop to Kakfa - I don't think so or I have not seen this integration.
You can build data ingestion from SqlServer to Kafka as disucssed in the above link.
Created 02-01-2016 12:41 PM
take a look at nifi, you can sqoop into a spooling dir, have Kafka pick up from there on. Spark streaming in nifi already exists and Storm is going to be included soon. Rough idea of your last inquiry
Sqoop incremental into hdfs directory > watch hdfs dir with nifi > putKafka > Stormspark
You can also split to two pipes in nifi and join into one pipe from two
Created 02-01-2016 12:44 PM
Look at this example Link for spark streaming an this example for Kafka Link @Krishna Srinivas
Created 04-01-2016 01:02 PM
You can use Streamsets Data Collector, which is open source, to read from JDBC Databases and put to Kafka topics.
Created 07-03-2017 07:15 AM
Guys,
I would prefer Kafka only when data is pushing from an external system.
and another place where I use Kafka, pulled data will be used by multiple parties .so that each consumer connects to kafka topic.
when you have control to pull the data then you can go for custom receivers in Spark. pull what you can consume.
which avoids the extra overhead of maintaining Kafka cluster for balancing the load.
Regards,
@Ram.
Created 07-25-2017 01:17 PM
Currently we are implementing a POC in which we require to import real time data from RDBMS to Kafka using Attunity..How to implement the same
Created 02-08-2019 02:58 PM
Hi all, is there any update on open source CDC tool for SQL Server?