Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Data processing pipeline tools and architecture for unstructured non-stanard file types

Data processing pipeline tools and architecture for unstructured non-stanard file types

Contributor

I am looking for suggestions on how to avoid a script soup mess to setup a pipeline to send files down through a processing. Unstructured files in my case are binary files which in this case are proprietary CAD files.

pipeline.

A 30k foot view currently.

The CAD files are in a PostgreSQL database that have table references to all the metadata and a file pointer to the actual file on the file system. What this pipeline needs to do is:

1. Read each record and do a conversion to a standard (non proprietary) format (we already have a tool to do the actual conversion of the file).

2. Insert a record into another table with references to the part on the FS along with all of its metadata (filename, hash, log of conversion etc.)

3. Take each record from part and do CAD extraction on the data (again tools already built to do this extraction of the data we want it just needs to be called through the pipeline).

4. Insert the extracted data into the database (keeping references all intact through the process).

5. Lots of other extraction/conversion/calculations...that I won't go into. Once I can get an architecture/pattern working, I feel adding on to this will be fairly trivial.

We are doing all of this in Python and unfortunately has to be on Windows.

I built a pipeline with Luigi to do the conversion before we had the database and it worked, but after about 5-10k tasks it started to choke under its own weight. Once we started putting the metadata and stuff into the database I figured out that its a LOT harder writing tasks with Luigi when dealing with the database.

I don't think NiFi gives me any benefit. Would something like Kafka provide the structure for a pipeline of what I'm wanting? After reading about it, but seems like I could still end up with a bit of script soup potentially?

For example I was thinking something for step 1) have a producer sending each record to a Kafka "step 1 topic". On the other end, I could have a python script (could there be something used in place to manage these scripts?) as the consumer reading the topic and processing each one as it hits the queue? At that point just sending messages and having producers and consumers processing items as each their respective topics.

Would there be a better way of doing this? Managing this?

Don't have an account?
Coming from Hortonworks? Activate your account here