Hello. I am new to Nifi and spark specifically as it relates to this request. So be kind :) .
I have dozens of files every week with 1-2 million rows each. The rows in the files are sometimes new, and sometimes updates of previous records (with no unique key, but a compound key can be created from multiple columns). I have this situation for 3 groups of files, which, at the end of consumption end up being joined together into 1 large table. For the sake of this post, lets call those groups of files, SOLD, PLACED, COMPLETE, each group having multiple files, with millions of rows. I do not have any control over the build/structure (column name/order) of the inbound files. They are from a 3rd party.
With-in each group, the file columns are frequently not consistently named, or in the same order, but they always contain the same fields contextually (meaning, the column may be named wrong, but the data is correct in the column).
Using NIFI I can move the files from FTP to S3 easily. Planning to "Upsert" the rows of each group into their own table, then join the 3 tables at the end , I started with using a redshift COPY command to ingest the files into redshift. This works until the columns of the files are out of order or incomplete. I've tried branching on "retry" in NIFI to adapt to each file, but the variations will be too many to manage that way.
I need a way to basically populate a complete "object" prior to "upsert" that I can match columns/attributes on a regex (to match the varying column names), to a fixed schema.
I was also thinking, perhaps I eliminate redshift, "upsert" the files with Spark, then join them with spark, and save out the output to S3 as 1 large file.
Any suggestion on how you would tackle this problem as efficiently as possible?