I'm looking for any suggestions for optimizing the speed of a bulk insert into an Oracle table.
In the flow, we are querying a Hive table then converting the results to JSON, then splitting that flowfile line by line using SplitText. The result is a flowfile for each row in the hive result set (1 flowfile ---> many flowfiles). Then, each flowfile is turned into a sql insert statement that inserts a single row into an Oracle table.
What I'm trying to figure out is a more efficient way of inserting into the Oracle table. I've tried manipulating the "batch size" and "concurrent tasks" properties on the PutSQL processor but this has had negligible effects on the performance.
Can anyone suggest an alternative approach or point out something that I may have overlooked?
Any help would be greatly appreciated!
I would probably use a MergeContent processor to create FlowFiles that have more than one line; convert these to some delimited format (like CSV). Then I would create a little wrapper shell script that takes its input and pipes it into Oracle SQL*Loader, and then use ExecuteStreamCommand to invoke that script. This is, of course, an Oracle specific approach.
Unless the SQL statement in the content of each of your FlowFiles is identical, a new transaction will be created for each FlowFile rather then multiple FlowFiles in a single transaction. This is because of how the putSQL is designed to do batch inserts.
The SQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format.
So consider the case where you are inserting to a common DB table. In order to have every insert statement be exactly identical you will need to create attributes sql.args.N.type and sql.args.N.value for each column.
sql.args.1.type ---> 12
sql.args.1.value ---> bob
sql.args.2.type ---> 12
sql.args.2.value ---> smith
sql.args.3.type --> 12
sql.args.3.value --> SME
And so on.... You can use UpdateAttribute and maybe ExtractText processors to set these attributes on your FlowFiles.
The you can use a ReplaceText processor to replace the content of your FlowFile with a common INSERT statement like below:
INSERT INTO mydb("column1","column2","column3") VALUES(?,?,?)
Now every FlowFile will have identical content and batch inserts will work in a single transaction. The "?" are replaced by the values from the attributes sql.args.1.value. sql.args.2.value, sql.args.3.value, and so on....