Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Realistic data ingestion, processing (ETLing) architecture

Highlighted

Realistic data ingestion, processing (ETLing) architecture

Contributor

I'm struggling with setting up data ingestion pipeline/ETL pipelines/processing pipelines/architectures.

I've had MANY tasks over the past few years that deal with ETLing data and getting it put into a usable form and I'm struggling with setting up a reasonable process that is general enough, easy enough, reusable enough to take care of the tasks I throw at it. These projects are normally completely separate and have NO overlap (project 1 does not necessarily have anything to do with project 2 and can be completely unrelated, with completely different set of circumstances, restrictions etc).

In the past for most of these projects I've written an application in python (normally) to do everything from pulling the files from their source (XML, CSV, other databases), processed the data accordingly, and pushed it to their respective destinations. This works, and while having to write everything with python from pulling the data, to processing it, to pushing it to the destination is tedious at times, it provides an EXTREMELY high level of flexibility to handle all of the "issues" that arise when you actually get in the middle of the data that you can't foresee when someone says "hey can you capture this data and make it available for us"? I have a lot of confidence in Python and the libraries that I can handle almost any exceptions without hacks and tons of inefficient work.

So take this work as example.

I have a CSV that is messy. A few of the "translations" that needs to take place:

  1. You have to extract dates from the filename, prepend that to all of the rows of the CSV.
  2. Do some massaging of each record to "fix" each records "string" issue (each record has strings with commas in the string itself). The only constant is that I know that there are 25 "columns" in the csv and then the 26th is the string. The string can contain multiple commas
  3. Parsing out the content of the string itself. All records are not the same. Each record will have some type of "key" in one of the fields, I have to look for that in all of the records, if it does, I need to extract/parse information out of that field and then have that stored into the database. This is not complex and the example of extraction I have to work from is about 300 lines of MATLAB code that I'm basically going to be implementing again to get what they want. It has a decent amount of logic.

Using Nifi, Shu (much appreciated) from my example thread was able to help me do #1 and #2. But #3 (which I didn't ask) is not going to be feasible at all. Sure, I can go call an external processing script with Nifi, but then that comes into the area of having all of these "random" scripts inside of Nifi that just handle random different parts of the process that makes things difficult to test and maintain.

It almost feels like I need something of a "programmable" nifi/libarary/framework. I can tell the framework to simply "go tail these sets of files with X parameters" on this schedule, when it runs I get the context of that information back similar to nifi but I have it available directly in python (can be any just an example) where I can use pandas or some other libraries to manipulate that data as I see fit, do my processing on it, and then call other "nifi like" processors from within python directly to putfile, insert into a database, pass it to a message queue, or some other action.

Is there something like this that makes it easier from a programmatical sense? What are my options? Am I missing what I should be doing with nifi on ingesting/processing/shipping data?

The only reasonable solutions I can think of with using nifi would be.

1. Use nifi to pull data, do as little processing, manipulation on source data as possible be put into a staging storage system, and then use python (or some other language) to pull the data from the storage system, process it as needed and then insert into back into a result storage system (database, processed csv, etc.)

2. Use nifi to pull the data, do basically no processing on data, send it over some messaging queue, consume it with python, do more complex manipulation of the data in an environment more appropriate to actually data processing and then either write that out to a permanent data store (db, etc.)

I'm struggling with finding a robust, maintainable, readable, reusuable (within reason), testable, programmable, organized, process for doing pipeline/ETL/processing/data shipping architecture. Any suggestions would be appreicated.