What are some approaches to ingest near real time data ( defined as less than 10 min ) from RDBMS system to Hadoop ? Is there any guidance on which approach suits in which case ? I can think of SQOOP, Flume, NiFi. Are there any other open source alternatives ?
- Sqoop should work if the data volumes are not too big and you don't want to be faster than 5 min.
- For more realtime you can use Change Data Capture solutions. For example LinkedIn built a Postgres connector to Kafka and Oracle Golden Gate also has a Bigdata connector. IBM CDC also should have one.
- You might be able to use Storm but you would have to write a JDBCSpout. ( Pretty sure the same would work with Spark Streaming )
- oh and if you are willing to throw some development time at it, you could take this JDBCStoragehandler and fix it up. I once got it to run for the new Hive version and fixed problems like predicate pushthrough. However some issues still remained. Its not quite straight forward. But it would be pretty awesome because you could directly join DB tables with Hive tables and select from db table insert into hive table as needed.
What Jonas said. Essentially any streaming tool could implement this. Nifi being a great choice as well. The main problem is performance on the db side since they all require frequent queries against the database to get the new data. If you want it more real time or the tables are too big for the frequent query you may have to look at one of the change data capture tools I mentioned.
Why not leverage Calcite for joining Hive and RDBMs instead of JDBCStorageHandler? I've seen a working demo of Phoenix join with MySQL thanks to Calcite.
you can use sqoop for structured data(RDBMS) or flume for streaming data.
True change data capture out of an RDBMS requires software that will follow the redo logs and capture SQL that matches some configuration (Oracle Golden Gate). These solutions are generally proprietary. You can of course, simply poll a table with a tool like Sqoop but I am not sure how well this will scale or how supportable it is in production.
Sqoop, GoldenGate, Attunity Replicate, now nifi with QueryDatabaseTable for simple change capture and Kafka Connect as well, options vary in complexity and cost.
Currently we are implementing a POC in which we require to import real time data from RDBMS to Kafka using Attunity..How can we implement this.