Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Right tool-set for Data processing/ ETL Requirement

Right tool-set for Data processing/ ETL Requirement

New Contributor

Hello All,

We are currently working on a project to identify right toolset form Hortonworks big data platform.

The requirement is to read 6 TB of records every night form source Oracle DB, apply several rules (PL/SQL like procedural code) to identify fraudulent transactions and load the fraudulent transactions identified into target oracle DW for reporting purpose.

We are planning to take advantage of Hortonworks cluster for data processing /ETL purposes as we expect the data to be processed from source would increase with time.

Considering above requirement, I would like to know the right tool set form Horton works platform to ‘read data from source Oracle DB’, ‘apply procedural transformation rules’ and ‘write to target oracle DB’?

At this moment the requirement is bath processing of source data; in future we might try real time data processing.

I analyzed NiFi, Spark (dataframes, soark sql) combination, but would like to check if there exists a better technology combination for the requirement.



Re: Right tool-set for Data processing/ ETL Requirement

Hi @Venkat Gali, based on the process you described I'd recommend Sqoop for general data loading from RDBMS. You'll need to use a 3rd party solution like Attunity (or Oracle's GoldenGate) for more real-time data loads. Once the data is loaded your fastest process for transformation workloads can be handled by Spark, Pig, Hive LLAP, or a combination of all of them. You may also want to look at HPL\SQL but its new on the scene and not fully baked into the platform.

I hope this helps get you started.