We are currently working on a project to identify right toolset form Hortonworks big data platform.
The requirement is to read 6 TB of records every night form source Oracle DB, apply several rules (PL/SQL like procedural code) to identify fraudulent transactions and load the fraudulent transactions identified into target oracle DW for reporting purpose.
We are planning to take advantage of Hortonworks cluster for data processing /ETL purposes as we expect the data to be processed from source would increase with time.
Considering above requirement, I would like to know the right tool set form Horton works platform to ‘read data from source Oracle DB’, ‘apply procedural transformation rules’ and ‘write to target oracle DB’?
At this moment the requirement is bath processing of source data; in future we might try real time data processing.
I analyzed NiFi, Spark (dataframes, soark sql) combination, but would like to check if there exists a better technology combination for the requirement.
Hi @Venkat Gali, based on the process you described I'd recommend Sqoop for general data loading from RDBMS. You'll need to use a 3rd party solution like Attunity (or Oracle's GoldenGate) for more real-time data loads. Once the data is loaded your fastest process for transformation workloads can be handled by Spark, Pig, Hive LLAP, or a combination of all of them. You may also want to look at HPL\SQL but its new on the scene and not fully baked into the platform.
I hope this helps get you started.
Based on your requirements, I have listed a few tools that suits your business requirements.