Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

traditional ETL vs open source

traditional ETL vs open source

New Contributor

Recommendation on using traditional ETL vs open source with respect to Hortonworks.


Re: traditional ETL vs open source

Expert Contributor

@sushil nagur

Some options:

- Hive can be used and is a common pattern. Land the data in HDFS, and use HiveQL to cleanse, transform, into a Hive Table (e.g. ORC format). HBase can also be a target (or indeed Solr).

- SparkSQL is often used to ingest data. Again, land in HDFS, and use SparkSQL to process and add to Hive/HBase tables.

- HDF (Ni-Fi) is more of a stealth ETL Tool or simple event processing, but can perform a number of transforms (also includes an expression builder/language, and many out of the box processors for different sources/targets.

- Pig can be used to build data pipelines. Sqoop can be used to extract data, but only performs basic transforms.

- Hortonworks has an eco-system of Partners with ETL solutions (e.g. Syncsort, etc.).

- Storm and Spark Streaming are options for streaming operations, can be use Kafka as a buffer.

In terms of commercial ETL vs Open Source, it comes down to many points - requirements, budget, time, skills, strategy, etc. The commercial ETL tools are mature, and some have sophisticated functionality. transformations, and connectivity. Hortonworks partners with commercial ETL vendors when the scenario fits. In other scenarios, native HDP tooling (as listed above) is sufficient.

HTH, Graham

Re: traditional ETL vs open source

New Contributor

thanks Graham

Re: traditional ETL vs open source

It comes down to your comfort level and the type of ETL you’re are trying to do to give your a proper recommendation. The biggest difference is that you have less GUI’s (but some good ones!) to work with for ETL in the Hortonworks stack. If your comfortable with some SQL, scripting and programming our stack is great for doing ETL at scale. Here’s a break down of the tools and where you can use them in our stack

ETL Options in Hortonworks

Extraction / Load - Apache Sqoop, Apache NiFi, SyncSort

Transformations - Apache Hive, Apache Spark, Apache Pig, Apache NiFi

Other items to consider for ETL Work

Orchestration - Ambari Workflow Manager (Oozie UI), Apache NiFi

Data Discovery - Apache Zeppelin, Apache SOLR

Additionally, ETL takes several forms in Hadoop.

  1. ELT is more of a common pattern. In a traditional Informatica ETL pattern, you would extract from source systems, transform in PowerCenter and land in target. In Hadoop, you’ll typically extract from source, land in Hadoop, transform, land in target (i.e. Hive). For this pattern, we would typically recommend Sqoop for EL and Hive, Spark or Pig for T.
  2. EtL (little t) is another pattern with streaming ingest pipelines. You’ll extract or capture the source, do light transformation (i.e. preparation, conversions, enrichment, etc) and then land into Hadoop. For these light transformations, they are not typically batch oriented. For this pattern, we would typically recommend Apache NiFi.

Things that are not in the platform that you have to account for.

  • Master Data Repository
  • Cleansing Rules
  • Enrichment Modules (i.e. address cleansing)
  • Change Data Capture
  • Reuseable Templates (except with NiFi)

In some cases you can use external services for the items above. Or because the beauty of Open Source is that it’s highly extensible, build or leverage integrations into other tools that may assist with cleansing, enrichment, etc. If you go back to the days before commercial ETL tools existed, you can build all of the items mentioned above as part of your overall data management environment.

Re: traditional ETL vs open source

Super Guru
@sushil nagur

I agree with both @Graham Martin and @ccasano. Instead of talking about tools which you already know from above answers, I'll talk about why CIOs prefer Hortonworks for offloading their existing ETL jobs.

As Graham mentions, we have partners like Informatica, Talend, Pentaho, Syncsort that you can use to write your ETL jobs in Hadoop. What this gives you is faster time to market which is the same story as previous ETL tools. They save time from writing code and your ETLs manually. Prevents bugs that you may have if you were to write your own code. Under the hood, they use similar technologies like Spark, Map/Reduce and even same fast connectors that Sqoop uses. So why use Hortonworks?

Because where is the storage engine where all the processing is happening? Without Hortonworks, in the legacy/existing systems, CIOs are paying significantly higher cost per TB of doing the ETL. Some companies are even doing ELT which means they first load data into their data warehouse and then use the processing power of that system to perform transformation. This takes away very expensive resources from reporting/adhoc queries from business which is what the EDW was purchased for to begin with. When you offload those jobs onto Hadoop, you free up all that capacity from these systems and free up the processing power for reporting and business use.

Your per TB cost of doing ETL in Hadoop is fraction of what it is in traditional ETL systems. This is the main motivation of offloading ETL in Hadoop. You perform ETL in Hadoop and then push your final result into your EDW.

Don't have an account?
Coming from Hortonworks? Activate your account here