Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ? Is there any guidance on which approach suits in which case ? I can think of SQOOP, Flume, NiFi. Are there any other open source alternatives ?

12 REPLIES 12

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

Different options.

- Sqoop should work if the data volumes are not too big and you don't want to be faster than 5 min.

- For more realtime you can use Change Data Capture solutions. For example LinkedIn built a Postgres connector to Kafka and Oracle Golden Gate also has a Bigdata connector. IBM CDC also should have one.

https://community.hortonworks.com/questions/12787/how-to-integrate-kafka-to-pull-data-from-rdbms.htm...

- You might be able to use Storm but you would have to write a JDBCSpout. ( Pretty sure the same would work with Spark Streaming )

https://community.hortonworks.com/questions/17524/is-there-any-way-to-keep-the-data-in-db2-and-using...

- oh and if you are willing to throw some development time at it, you could take this JDBCStoragehandler and fix it up. I once got it to run for the new Hive version and fixed problems like predicate pushthrough. However some issues still remained. Its not quite straight forward. But it would be pretty awesome because you could directly join DB tables with Hive tables and select from db table insert into hive table as needed.

https://github.com/qubole/Hive-JDBC-Storage-Handler

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

Thank You @Benjamin Leonhardi. Do you think Nifi will work in this situation ?

Highlighted

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

Yes absolutely, Nifi is a great option, take a look at this example (even though its not RDBMS) https://community.hortonworks.com/articles/8422/visualize-near-real-time-stock-price-changes-using.h...

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

What Jonas said. Essentially any streaming tool could implement this. Nifi being a great choice as well. The main problem is performance on the db side since they all require frequent queries against the database to get the new data. If you want it more real time or the tables are too big for the frequent query you may have to look at one of the change data capture tools I mentioned.

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

Mentor

Why not leverage Calcite for joining Hive and RDBMs instead of JDBCStorageHandler? I've seen a working demo of Phoenix join with MySQL thanks to Calcite.

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

Rising Star

you can use sqoop for structured data(RDBMS) or flume for streaming data.

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

Guru

True change data capture out of an RDBMS requires software that will follow the redo logs and capture SQL that matches some configuration (Oracle Golden Gate). These solutions are generally proprietary. You can of course, simply poll a table with a tool like Sqoop but I am not sure how well this will scale or how supportable it is in production.

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

Mentor

Sqoop, GoldenGate, Attunity Replicate, now nifi with QueryDatabaseTable for simple change capture and Kafka Connect as well, options vary in complexity and cost.

Re: ​What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ?

New Contributor

Currently we are implementing a POC in which we require to import real time data from RDBMS to Kafka using Attunity..How can we implement this.

Don't have an account?
Coming from Hortonworks? Activate your account here