Need help in getting some solution on the following
I want to read Oracle database archive logs and someway ingest into flume or some other tool and process the data (changes in the tables as per the logs ) further using spark or store it in hive tables.
The latter part is yet to be finalised but I am currently.looking for ingestion part.
I know there is golden gate but that's not feasibl for us owing to license cost.
So any open source tool for this please?
NiFi is perfect for this -- this is one of its most common use cases. Your use case would take about 5-10 minutes to develop and deploy using NiFi.
NiFi is much easier to develop than Flume and has a much greater capability set. It is UI- and configuration-oriented, so you can rapidly build, deploy and monitor flows. And there is a lot of Quality of Service features behind it. It is also enterprise-ready in terms of security, governance and multitenancy.
NiFi has a processor that tails a log file (sends new lines to flow at configured polling interval) and another processor that puts to HDFS or streams to Hive. You can also place a processor in between to do transformations or smart routing (send to one target if content has this text and send to another target of it has that text). It is usually a best practice not to transform, however, but to store all ingested data in HDP as raw so you can leverage the data for future use cases.
NiFi is part of HDF. NiFi and HDF are open source. HDF/NiFi is deployed on its own cluster and does not require HDP (Hadoop) but it is a very common integration.
These links can help you get started:
You can also take a look at the first part of this post -- it pretty much shows what you are attempting (but maybe you will not need the middle 3 processors and you would stream to Hive using PutHiveStreaming):
You can either install nifi on the server generating the logs and communicate to a central nifi instance using Remote Process Group.
Better yet though is to implement the tailfile as Minifi on the edge and send the data to a central nifi instance. This is the preferred way: Minifi is a lightweight processor deployment on the edge that is designed only for collecting data and sending from edge to nifi flow for processing.
See: http://hortonworks.com/blog/edge-intelligence-iot-apache-minifi/ (especially slideshare at bottom)
@Rishit shah I believe what you are looking for CDC (Change Data Capture), something like this: https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.htmlIt's been done for MySQL:
Here are a few other ways to do this: https://community.hortonworks.com/questions/12787/how-to-integrate-kafka-to-pull-data-from-rdbms.htm...
I think it is necessary to clarify that he is talking about Oracles Transaction and Redo logs which are stored in a binary format, not a log file you can easily tail (such as syslog or cron logs). If anyone has a useful solution to this, that would be awesome! CDC has been the bane of many a data engineer.
Jon, you are right in clarifying this and it's been some time after your comments. Did you or someone else figured out the real solution to do CDC with Oracle using redo logs or any other open source approach. It will be great to have that solution.