I'm struggling with setting up data ingestion pipeline/ETL pipelines/processing pipelines/architectures.
I've had MANY tasks over the past few years that deal with ETLing data and getting it put into a usable form and I'm struggling with setting up a reasonable process that is general enough, easy enough, reusable enough to take care of the tasks I throw at it. These projects are normally completely separate and have NO overlap (project 1 does not necessarily have anything to do with project 2 and can be completely unrelated, with completely different set of circumstances, restrictions etc).
In the past for most of these projects I've written an application in python (normally) to do everything from pulling the files from their source (XML, CSV, other databases), processed the data accordingly, and pushed it to their respective destinations. This works, and while having to write everything with python from pulling the data, to processing it, to pushing it to the destination is tedious at times, it provides an EXTREMELY high level of flexibility to handle all of the "issues" that arise when you actually get in the middle of the data that you can't foresee when someone says "hey can you capture this data and make it available for us"? I have a lot of confidence in Python and the libraries that I can handle almost any exceptions without hacks and tons of inefficient work.
So take this work as example.
I have a CSV that is messy. A few of the "translations" that needs to take place:
Using Nifi, Shu (much appreciated) from my example thread was able to help me do #1 and #2. But #3 (which I didn't ask) is not going to be feasible at all. Sure, I can go call an external processing script with Nifi, but then that comes into the area of having all of these "random" scripts inside of Nifi that just handle random different parts of the process that makes things difficult to test and maintain.
It almost feels like I need something of a "programmable" nifi/libarary/framework. I can tell the framework to simply "go tail these sets of files with X parameters" on this schedule, when it runs I get the context of that information back similar to nifi but I have it available directly in python (can be any just an example) where I can use pandas or some other libraries to manipulate that data as I see fit, do my processing on it, and then call other "nifi like" processors from within python directly to putfile, insert into a database, pass it to a message queue, or some other action.
Is there something like this that makes it easier from a programmatical sense? What are my options? Am I missing what I should be doing with nifi on ingesting/processing/shipping data?
The only reasonable solutions I can think of with using nifi would be.
1. Use nifi to pull data, do as little processing, manipulation on source data as possible be put into a staging storage system, and then use python (or some other language) to pull the data from the storage system, process it as needed and then insert into back into a result storage system (database, processed csv, etc.)
2. Use nifi to pull the data, do basically no processing on data, send it over some messaging queue, consume it with python, do more complex manipulation of the data in an environment more appropriate to actually data processing and then either write that out to a permanent data store (db, etc.)
I'm struggling with finding a robust, maintainable, readable, reusuable (within reason), testable, programmable, organized, process for doing pipeline/ETL/processing/data shipping architecture. Any suggestions would be appreicated.