Support Questions
Find answers, ask questions, and share your expertise

Right tool-set for Data processing/ ETL Requirement

New Contributor

Hello All,

We are currently working on a project to identify right toolset form Hortonworks big data platform.

The requirement is to read 6 TB of records every night form source Oracle DB, apply several rules (PL/SQL like procedural code) to identify fraudulent transactions and load the fraudulent transactions identified into target oracle DW for reporting purpose.

We are planning to take advantage of Hortonworks cluster for data processing /ETL purposes as we expect the data to be processed from source would increase with time.

Considering above requirement, I would like to know the right tool set form Horton works platform to ‘read data from source Oracle DB’, ‘apply procedural transformation rules’ and ‘write to target oracle DB’?

At this moment the requirement is bath processing of source data; in future we might try real time data processing.

I analyzed NiFi, Spark (dataframes, soark sql) combination, but would like to check if there exists a better technology combination for the requirement.



Hi @Venkat Gali, based on the process you described I'd recommend Sqoop for general data loading from RDBMS. You'll need to use a 3rd party solution like Attunity (or Oracle's GoldenGate) for more real-time data loads. Once the data is loaded your fastest process for transformation workloads can be handled by Spark, Pig, Hive LLAP, or a combination of all of them. You may also want to look at HPL\SQL but its new on the scene and not fully baked into the platform.

I hope this helps get you started.


Hi @venkatgalikm,


Based on your requirements, I have listed a few tools that suits your business requirements.


  1. Stitch: Stitch is a self-administration ETL data pipeline arrangement worked for engineers. The Stitch API can reproduce data from any source, and handle mass and steady data refreshes. Stitch likewise gives a replication motor that depends on various methodologies to convey data to clients. Its REST API bolsters JSON or travel, which empowers programmed identification and standardization of settled record structures into social constructions. Stitch can interface with Amazon Redshift design, Google BigQuery engineering, and Postgres engineering - and incorporates with BI tools. Stitch is regularly intended to gather, change and burden Google examination data into its own framework, to consequently give business bits of knowledge on crude data.
  2. Alooma: Alooma offers a venture scale data reconciliation stage with extraordinary ETL tools worked in. The organization puts a solid spotlight on fast pipeline development, data quality checking and blunder dealing with to guarantee that clients don't lose or degenerate data in a possibly mistake inclined ETL process, however it likewise offers the adaptability to mediate and compose your own contents to screen, clean and move your data varying. As referenced, Alooma is intended for big business scale activities, so in case you're a little startup with a little working financial plan, Alooma most likely isn't for you. 
  3. ETL Leap: Based on AWS engineering, etleap makes it simple to gather data from a wide scope of sources and burden them into your Redshift or Snowflake data stockroom. Its point-and-snap, no code interface makes it a solid match for data groups that need a great deal of authority over their ETL forms, however don't really need high IT overhead. Since it's coordinated with AWS, etleap additionally makes it simple to scale your data distribution center all over with the equivalent simple to-utilize interface, while simultaneously dealing with your ETL streams on the fly. When data has been gathered utilizing one or a considerable lot of its 50+ data mixes, clients can likewise exploit etleap's graphical data fighting interface or fire up the SQL editorial manager for data displaying and change. Organization and booking highlights make dealing with all your ETL pipelines and procedures as simple as the snap of a catch. Notwithstanding its SaaS offering, etleap additionally gives an adaptation that can be facilitated all alone VPC.
  4. Blendo: Blendo offers a cloud-put together ETL tool centered with respect to letting clients get their data into distribution centers as fast as conceivable utilizing their set-up of exclusive data connectors. Blendo's ETL-as-an administration item makes it simple to pull data in from a wide range of data sources including S3 cans, CSVs, and an enormous cluster of outsider data sources like Google Analytics, Mailchimp, Salesforce and numerous others. When you've set up the approaching finish of the data pipeline, you can stack it into various diverse stockpiling goals, including Redshift, BigQuery, MS SQL Server, Panoply and Snowflake.
  5. Sprinkle Data: Sprinkle is a SaaS platform providing ETL tool for organisations.Their easy to use UX and code free mode of operations makes it easy for technical and non technical users to ingest data from multiple data sources and drive real time insights on the data. Their Free Trial enables users to first try the platform and then pay if it fulfils the requirement.
; ;