Support Questions
Find answers, ask questions, and share your expertise

All the different ETL/pipeline tools are becoming blurry as to which to use for what


So the picture is getting quite blurry between all of the pipeline/etl tools available.


* NiFi

* StreamSets

* Kafka (?)

* Luigi

* Airflow

* Falcon

* Oozie

* A Microsoft solution?

I've got several projects that I could see a use for a pipeline/flow tool where ETLing is the point of the entire project. So what are the strengths and weaknesses of each? Where should I be using one or the other? Where does one shine where the other would be difficult to manage or be overkill for the project? Which would be the most light-weight of the tools?

I have several projects but have two stick out in my mind. They are completely unrelated to each other at all. They do NOT overlap at all.

1) The project is a simple ETL for XML data. In simple terms, 20 or so machines write out XML log data to their local drive that is shared on the network. A python application connects to each machine's share, copies the data to the local system for archival purposes of the raw data. The same application reads the XML data from the files, extracts all of the relevant content from the XML files and stores it into a Microsoft SQL Server database. Currently the application gets run every 20 minutes through a Huey cronjob task in Python to look for new data on the share. This is a Windows-only application/ecysystem so using something in the MS world isn't out of the question either (hence why I included it).

2) The second project is more "pipeline". We have about 2 million files that will need to run through a process of a) Original Format --> b)Converted to an industry standard format --> c) data massaged to fit our need --> d)Data converted --> e) intermediate results are written out to disk --> f)data use to train deep learning model to train model. For inference of a file, steps a), b), c), d), e), f) would be performed. Step f) would be replaced with the inference of the model and then f) would pass results down to g) (another application). This is initially going to be done on Linux that they want to end up (potentially) on Windows with so that could be a consideration.

So for these two items what would you end up choosing? From everything I have read and researched NiFi would be able to handle the get and put of the data files easily, but calling the python code to extract the data and put it in the database, how would NiFi handle that? I also looks to me that NiFi/StreamSet are a lot more heavy weighted and are usually operating within the Hadoop ecosystem. I'm not working with Hadoop/HDFS on either of these two applications. Any input on the strengths/weaknesses/specific use case for these examples would be greatly appreciated!



Please find below list of a few leading ETL Tools


  1. Matillion: Matillion's ETL tool is, as indicated by its designers, reason worked for cloud data distribution centers, so it could be an especially solid decision for clients who are particularly keen on stacking data into Amazon Redshift, Google BigQuery or Snowflake. With more than 70 local data source incorporations, just as a discretionary no-code graphical interface, Matillion makes stacking your data into your distribution center of decision basic and clear. It likewise robotizes the data changes you'll require so as to prepare your data for examination with your preferred BI tool. Matillion is charged hourly for use, so it could likewise be especially appealing for those with a ton of ETL personal time.
  2. Talend: Talend open source data combination programming items give programming to coordinate, purge, cover and profile data. Talend has a GUI that empowers dealing with countless source frameworks utilizing standard connectors. It additionally has Master Data Management (MDM) usefulness, which permits associations to have a solitary, predictable and exact perspective on key venture data. This can make better straightforwardness over a business, and lead to better operational proficiency, promoting viability and consistence.
  3. Fivetran: Fivetran is a completely overseen data pipeline with a web interface that incorporates data from SaaS administrations and databases into a solitary data stockroom. It gives direct reconciliation and sends data over a direct secure association utilizing a modern reserving layer. This reserving layer assists with moving data starting with one point then onto the next while never putting away a duplicate on the application server. Fivetran doesn't force any data limit, and can be utilized to bring together an organization's data and incorporate all sources to decide Key Performance Indicators (KPIs) over a whole association.
  4. Stitch: Stitch is a self-administration ETL data pipeline arrangement worked for engineers. The Stitch API can recreate data from any source, and handle mass and steady data refreshes. Stitch additionally gives a replication motor that depends on numerous methodologies to convey data to clients. Its REST API underpins JSON or travel, which empowers programmed identification and standardization of settled record structures into social outlines. Stitch can associate with Amazon Redshift design, Google BigQuery engineering, and Postgres design - and incorporates with BI tools. Stitch is ordinarily intended to gather, change and burden Google examination data into its own framework, to naturally give business bits of knowledge on crude data.
  5. Sprinkle Data: Sprinkle is a SaaS platform providing ETL tool for organisations.Their easy to use UX and code free mode of operations makes it easy for technical and non technical users to ingest data from multiple data sources and drive real time insights on the data. Their Free Trial enables users to first try the platform and then pay if it fulfils the requirement.
; ;