Dear all, I've been trying to find a way to process csv files in our staging folder in hadoop on a regular basis (ideally with Oozie). I currently have a shell script that just processes files one by one but there has to be a smarter and faster way to process multiple files at the same time right? Something with Oozie (and Falcon)?
We work with the Hue UI for Oozie (Oozie Editor) but in there I don't see a way to split processing of different files in one folder (recursive), or at least, I have no clue how to set the decision parameters.
Process flow, in high level, looks like this: Files (csv) will arrive in our staging folder (no split per hour/day/month or whatsoever, but per data provider). The files should be processed if they follow a certain file name convention. A script performs some basic checks and after those checks pass I would like to move the file into the external (partitioned) staging table folder from where we will create an orc partition. There we perform some other checks before we transfer the partition to the target schema/table.
Can someone point me into the right direction?
You will not like to hear this but no there is no automatic better way. Since your files have a different format each and need to end up in different data streams you need some kind of script that does this split. Pretty much all hadoop processes work on a folder level. So multiple files in one folder belong to a single data stream as far as oozie/falcon/pig/mapreduce/hive/.... are concerned.
What you can do is to write an oozie job that runs a shell script to move every file into a folder
and then have one oozie coordinator for each of them. That would be the standard approach.
However if you want this to be more dynamic as in, they should pick up the file name and can deduce from this the hive name etc. then there is no build-in way to do this and you need some kind of script. Python etc.
There might be some way to use Nifi to read files from a local system and use the filename metadata tag of the flowfile to influence the processing later but not with Oozie/falcon.
Hi Benjamin, thanks for the info. Unfortunately its not possible to set a coordinator on those datafolder because we will be adding more dataproviders, so more folders later during the project so the solution should be able to automatically handle these new providers without a new setup from our side.
We are now tending towards the idea of setting up one coordinator that starts a shell script to polls the staging folder. All files that pass our tests will be moved to the partition folder and then the shell script invokes a new workflow to process those files. I think that way we can have several workflows running in parallel. I'm going to try some things out the coming week. Once we decided on our path forward I will share it here.
We have looked at NiFi but it's currently not in our production platform yet and we can't wait for it to come. I also thought NiFi was more for getting files into Hadoop, not move them around/process files in hadoop itself?