Hi, I have a basic etl flow (screenshot below) where i'm trying to start loading a full table from db to another, and then incrementally load any changes to the table based on an updated date. It runs, but part of it is slow. I think the converting json to sql is a bottle neck, as the table has 2M+ records, and they're queued up, but it can only create insert statements so fast, and then inserts to redshift (the target) are not the fastest.
If you look closely at screenshot below right before the convertjsontosql, can see the queue of 3gb (which is about the size of the table), waiting to be converted I believe. Read some of the articles, and set the translate field names to false. also set ignore unmatched columns and fields.
So the initial loading of data is a bit slow. Any suggestions, workarounds, ideas are appeciated to have a process that i know i can start from scratch, load a table fully and then keep loading? or maybe I do have to do one initial large load somehow, and then start the incremental process? I know copy commands in redshift are way better, perhaps i have to do that first, and then start incremental?
You can replace everything from SplitAvro -> PutSQL with PutDatabaseRecord, that should give you a pretty good speedup as it takes the Avro directly in, generates the PreparedStatement once, and then does batch inserts for the entire incoming file.
As of NiFi 1.6.0 (via NIFI-4836), if you don't care about the maxvalue.* and fragment.count attributes, you can also set the Output Batch Size property, which will send out X flow files immediately when they are ready, versus keeping them all in the session until all rows are processed and then sending all downstream. This allows you to start processing rows downstream while QueryDatabaseTable is still processing rows from the result set, which comes in handy on the initial load of 2M+ records.
Hey @Matt Burgess, i made a new process group with this, but forgot to use the output batch size property. so i stopped it, changed, and am trying to clear queue and restart, but its stuck in the queue. I've had this happen a few times. any suggestions here? i can't stop, play, terminate or anything.
When i do hit empty, it says zero out of 3 flowfiles were removed, like it can't remove them. The processor after the queue has no option to start it. Thanks
Hi, nevermind on that, i just created a new process group and tried again. its running now, but still seems slow. the putdbrecord has 3 pending actions, but no other stats yet (eg, in/out). 1.79gb pending in the queue.
I set my output batch size of querydbtable to 5,000. i also set fetch size limit and max rows per flowfile to 5k too. does that make sense? thanks
Yes that's correct, otherwise all nodes could grab the same data. If you want to distribute the fetching among the nodes in your cluster, use GenerateTableFetch (on the primary node only still) -> RPG -> Input Port (on your same cluster) -> ExecuteSQL. GTF will not execute the SQL like QueryDatabaseTable does, it just generates the statements to be executed. The RPG -> Input Port distributes the flow files (containing SQL statements) among the nodes in the cluster, then each ExecuteSQL takes the ones it gets and actually does the fetch. Note that you can't merge the results back together on a single node once they've been distributed, but that's usually a good thing, as you can often put the results to a target (such as another DB) in a distributed fashion.
I actually think the pull is not my problem. its inserts to redshift, which are slow. think i need to put this data in files, and do copy commands to redshift. guess i need to move to saving this data in files, and then copy/load.
Or i guess i can manually load tables, and then set an initial max value column, to have the nifi job start from there, and continue updating. These large tables are challenging.