- Hive can be used and is a common pattern. Land the data in HDFS, and use HiveQL to cleanse, transform, into a Hive Table (e.g. ORC format). HBase can also be a target (or indeed Solr).
- SparkSQL is often used to ingest data. Again, land in HDFS, and use SparkSQL to process and add to Hive/HBase tables.
- HDF (Ni-Fi) is more of a stealth ETL Tool or simple event processing, but can perform a number of transforms (also includes an expression builder/language, and many out of the box processors for different sources/targets.
- Pig can be used to build data pipelines. Sqoop can be used to extract data, but only performs basic transforms.
- Hortonworks has an eco-system of Partners with ETL solutions (e.g. Syncsort, etc.).
- Storm and Spark Streaming are options for streaming operations, can be use Kafka as a buffer.
In terms of commercial ETL vs Open Source, it comes down to many points - requirements, budget, time, skills, strategy, etc. The commercial ETL tools are mature, and some have sophisticated functionality. transformations, and connectivity. Hortonworks partners with commercial ETL vendors when the scenario fits. In other scenarios, native HDP tooling (as listed above) is sufficient.
It comes down to your comfort level and the type of ETL you’re are trying to do to give your a proper recommendation. The biggest difference is that you have less GUI’s (but some good ones!) to work with for ETL in the Hortonworks stack. If your comfortable with some SQL, scripting and programming our stack is great for doing ETL at scale. Here’s a break down of the tools and where you can use them in our stack
ETL Options in Hortonworks
Extraction / Load - Apache Sqoop, Apache NiFi, SyncSort
Transformations - Apache Hive, Apache Spark, Apache Pig, Apache NiFi
Other items to consider for ETL Work
Orchestration - Ambari Workflow Manager (Oozie UI), Apache NiFi
Data Discovery - Apache Zeppelin, Apache SOLR
Additionally, ETL takes several forms in Hadoop.
Things that are not in the platform that you have to account for.
In some cases you can use external services for the items above. Or because the beauty of Open Source is that it’s highly extensible, build or leverage integrations into other tools that may assist with cleansing, enrichment, etc. If you go back to the days before commercial ETL tools existed, you can build all of the items mentioned above as part of your overall data management environment.
I agree with both @Graham Martin and @ccasano. Instead of talking about tools which you already know from above answers, I'll talk about why CIOs prefer Hortonworks for offloading their existing ETL jobs.
As Graham mentions, we have partners like Informatica, Talend, Pentaho, Syncsort that you can use to write your ETL jobs in Hadoop. What this gives you is faster time to market which is the same story as previous ETL tools. They save time from writing code and your ETLs manually. Prevents bugs that you may have if you were to write your own code. Under the hood, they use similar technologies like Spark, Map/Reduce and even same fast connectors that Sqoop uses. So why use Hortonworks?
Because where is the storage engine where all the processing is happening? Without Hortonworks, in the legacy/existing systems, CIOs are paying significantly higher cost per TB of doing the ETL. Some companies are even doing ELT which means they first load data into their data warehouse and then use the processing power of that system to perform transformation. This takes away very expensive resources from reporting/adhoc queries from business which is what the EDW was purchased for to begin with. When you offload those jobs onto Hadoop, you free up all that capacity from these systems and free up the processing power for reporting and business use.
Your per TB cost of doing ETL in Hadoop is fraction of what it is in traditional ETL systems. This is the main motivation of offloading ETL in Hadoop. You perform ETL in Hadoop and then push your final result into your EDW.