- Hive can be used and is a common pattern. Land the data in HDFS, and use HiveQL to cleanse, transform, into a Hive Table (e.g. ORC format). HBase can also be a target (or indeed Solr).
- SparkSQL is often used to ingest data. Again, land in HDFS, and use SparkSQL to process and add to Hive/HBase tables.
- HDF (Ni-Fi) is more of a stealth ETL Tool or simple event processing, but can perform a number of transforms (also includes an expression builder/language, and many out of the box processors for different sources/targets.
- Pig can be used to build data pipelines. Sqoop can be used to extract data, but only performs basic transforms.
- Hortonworks has an eco-system of Partners with ETL solutions (e.g. Syncsort, etc.).
- Storm and Spark Streaming are options for streaming operations, can be use Kafka as a buffer.
In terms of commercial ETL vs Open Source, it comes down to many points - requirements, budget, time, skills, strategy, etc. The commercial ETL tools are mature, and some have sophisticated functionality. transformations, and connectivity. Hortonworks partners with commercial ETL vendors when the scenario fits. In other scenarios, native HDP tooling (as listed above) is sufficient.
It comes down to your comfort level and the type of ETL you’re are trying to do to give your a proper recommendation. The biggest difference is that you have less GUI’s (but some good ones!) to work with for ETL in the Hortonworks stack. If your comfortable with some SQL, scripting and programming our stack is great for doing ETL at scale. Here’s a break down of the tools and where you can use them in our stack
ELT is more of a common pattern. In a traditional Informatica ETL pattern, you would extract from source systems, transform in PowerCenter and land in target. In Hadoop, you’ll typically extract from source, land in Hadoop, transform, land in target (i.e. Hive). For this pattern, we would typically recommend Sqoop for EL and Hive, Spark or Pig for T.
EtL (little t) is another pattern with streaming ingest pipelines. You’ll extract or capture the source, do light transformation (i.e. preparation, conversions, enrichment, etc) and then land into Hadoop. For these light transformations, they are not typically batch oriented. For this pattern, we would typically recommend Apache NiFi.
Things that are not in the platform that you have to account for.
Master Data Repository
Enrichment Modules (i.e. address cleansing)
Change Data Capture
Reuseable Templates (except with NiFi)
In some cases you can use external services for the items above. Or because the beauty of Open Source is that it’s highly extensible, build or leverage integrations into other tools that may assist with cleansing, enrichment, etc. If you go back to the days before commercial ETL tools existed, you can build all of the items mentioned above as part of your overall data management environment.
I agree with both @Graham Martin and @ccasano. Instead of talking about tools which you already know from above answers, I'll talk about why CIOs prefer Hortonworks for offloading their existing ETL jobs.
As Graham mentions, we have partners like Informatica, Talend, Pentaho, Syncsort that you can use to write your ETL jobs in Hadoop. What this gives you is faster time to market which is the same story as previous ETL tools. They save time from writing code and your ETLs manually. Prevents bugs that you may have if you were to write your own code. Under the hood, they use similar technologies like Spark, Map/Reduce and even same fast connectors that Sqoop uses. So why use Hortonworks?
Because where is the storage engine where all the processing is happening? Without Hortonworks, in the legacy/existing systems, CIOs are paying significantly higher cost per TB of doing the ETL. Some companies are even doing ELT which means they first load data into their data warehouse and then use the processing power of that system to perform transformation. This takes away very expensive resources from reporting/adhoc queries from business which is what the EDW was purchased for to begin with. When you offload those jobs onto Hadoop, you free up all that capacity from these systems and free up the processing power for reporting and business use.
Your per TB cost of doing ETL in Hadoop is fraction of what it is in traditional ETL systems. This is the main motivation of offloading ETL in Hadoop. You perform ETL in Hadoop and then push your final result into your EDW.
Here we have listed a few ETL tools both, traditional and Open source you can have a look at them and see for yourself which one suits your use case.
1. Panoply: Panoply is the main cloud ETL supplier and data warehouse blend. With 100+ data connectors, ETL and data ingestion is quick and simple, with only a couple of snaps and a login among you and your recently coordinated data. In the engine, Panoply is really utilizing an ELT approach (instead of conventional ETL), which makes data ingestion a lot quicker and progressively powerful, since you don't need to trust that change will finish before stacking your data. What's more, since Panoply fabricates oversaw cloud data warehouses for each client, you won't have to set up a different goal to store all the data you pull in utilizing Panoply's ELT procedure. On the off chance that you'd preferably utilize Panoply's rich arrangement of data gatherers to set up ETL pipelines into a current data warehouse, Panoply can likewise oversee ETL forms for your Azure SQL Data Warehouse.
2. Stitch: Stitch is a self-administration ETL data pipeline. The Stitch API can reproduce data from any source, and handle mass and gradual data refreshes. Stitch additionally gives a replication motor that depends on various techniques to convey data to clients. Its REST API underpins JSON or travel, which empowers programmed recognition and standardization of settled report structures into social constructions. Stitch can associate with Amazon Redshift engineering, Google BigQuery design, and Postgres design - and incorporates with BI apparatuses. Stitch is normally intended to gather, change and burden Google examination data into its own framework, to naturally give business bits of knowledge on crude data.
3. Sprinkle: Sprinkle is a SaaS platform providing ETL tool for organisations.Their easy to use UX and code free mode of operations makes it easy for technical and non technical users to ingest data from multiple data sources and drive real time insights on the data. Their Free Trial enables users to first try the platform and then pay if it fulfils the requirement.
Some of the open source tools include
1. Heka: Heka is an open source programming framework for elite data gathering, investigation, observing and detailing. Its principle part is a daemon program known as 'hekad' that empowers the usefulness of social occasion, changing over, assessing, preparing and conveying data. Heka is written in the 'Go' programming language, and has worked in modules for contributing, disentangling, separating, encoding and yielding data. These modules have various functionalities and can be utilized together to assemble a total pipeline. Heka utilizes Advanced Message Queuing Protocol (AMQP) or TCP to transport data starting with one area then onto the next. It tends to be utilized to stack and parse log records from a document framework, or to perform constant investigation, charting and inconsistency recognition on a data stream.
2. Logstash: Logstash is an open source data handling pipeline that ingests data from numerous sources at the same time, changing the source data and store occasions into ElasticSearch as a matter of course. Logstash is a piece of an ELK stack. The E represents Elasticsearch, a JSON-based hunt and investigation motor, and the K represents Kibana, which empowers data perception. Logstash is written in Ruby and gives a JSON-like structure which has a reasonable division between inner items. It has a pluggable structure highlighting more than 200 modules, empowering the capacity to blend, coordinate and arrange offices over various information, channels and yield. This instrument can be utilized for BI, or in data warehouses with bring, change and putting away occasion capacities.
3. Singer: Singer's open source, order line ETL instrument permits clients to assemble measured ETL pipelines utilizing its "tap" and "target" modules. Rather than building a solitary, static ETL pipeline, Singer gives a spine that permits clients to interface data sources to capacity goals. With a huge assortment of pre-constructed taps, the contents that gather datapoints from their unique sources, and a broad choice of pre-fabricated focuses on, the contents that change and burden data into pre-determined goals, Singer permits clients to compose succinct, single-line ETL forms that can be adjusted on the fly by trading taps and focuses in and out.