So the picture is getting quite blurry between all of the pipeline/etl tools available.
* Kafka (?)
* A Microsoft solution?
I've got several projects that I could see a use for a pipeline/flow tool where ETLing is the point of the entire project. So what are the strengths and weaknesses of each? Where should I be using one or the other? Where does one shine where the other would be difficult to manage or be overkill for the project? Which would be the most light-weight of the tools?
I have several projects but have two stick out in my mind. They are completely unrelated to each other at all. They do NOT overlap at all.
1) The project is a simple ETL for XML data. In simple terms, 20 or so machines write out XML log data to their local drive that is shared on the network. A python application connects to each machine's share, copies the data to the local system for archival purposes of the raw data. The same application reads the XML data from the files, extracts all of the relevant content from the XML files and stores it into a Microsoft SQL Server database. Currently the application gets run every 20 minutes through a Huey cronjob task in Python to look for new data on the share. This is a Windows-only application/ecysystem so using something in the MS world isn't out of the question either (hence why I included it).
2) The second project is more "pipeline". We have about 2 million files that will need to run through a process of a) Original Format --> b)Converted to an industry standard format --> c) data massaged to fit our need --> d)Data converted --> e) intermediate results are written out to disk --> f)data use to train deep learning model to train model. For inference of a file, steps a), b), c), d), e), f) would be performed. Step f) would be replaced with the inference of the model and then f) would pass results down to g) (another application). This is initially going to be done on Linux that they want to end up (potentially) on Windows with so that could be a consideration.
for these two items what would you end up choosing? From everything I have read and researched NiFi would be able to handle the get and put of the data files easily, but calling the python code to extract the data and put it in the database, how would NiFi handle that? I also looks to me that NiFi/StreamSet are a lot more heavy weighted and are usually operating within the Hadoop ecosystem. I'm not working with Hadoop/HDFS on either of these two applications. Any input on the strengths/weaknesses/specific use case for these examples would be greatly appreciated!
Please find below list of a few leading ETL Tools