Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to split the output flow file from ExecuteSQL (query perform joins)?

avatar
Explorer

Hello.
I'm using an ExecuteSQL Processor to extract data from Oracle DB. The query has multiple joins and returns a large number of fields.

The problem is that ExecuteSQL Processor returns a single flow file (avro format) that is huge. 

I want to split the flow file (based in a number of rows, for example) and then merge them at the proper moment.

I have read about QueryDatabaseTable and GenerateFetchTable which are processors that can split the output flow files but looks like these processors aren't able to perform joins..

 

Does anyone knows a workaround?

1 ACCEPTED SOLUTION

avatar
Explorer

Maybe I have found a solution..

I'm gonna use the ExecuteSQL to do a "select insert" query.. The query will perform the joins and load the data into a table. Then the QueryDatabaseTable will read from the new table.. That way I'll be able to use the "Max Rows Per Floe File" property.

View solution in original post

9 REPLIES 9

avatar
Expert Contributor

At ExecuteSQL Processor is a property named by "Max Rows Per Flow File".

There you can set how much rows each Flow File should be contain and later you can merge them like you wanted cause the flow files get an fragment attribute.

avatar
Explorer

There is no such a option at ExecuteSQL..

CaioFalco_0-1599164411538.png

 

Perhaps you're using a newer version of NiFi: Mine is 1.5.0.3.1.2.0-7

avatar
Contributor

You should use ListDatabaseTable and generatetablefetch to perform an incremental load. If you are joining the tables, you can do a replacetext after generatetablefetch to add the join query and then feed the flowfile to execute sql. You can split the amount of data in generatetablefetch. 

OR 

You can use splitrecord / splitcontent to split the single avro to multiple smaller files and then use mergecontent to merge them back if required.

Hope this helps. If the comment helps you to find a solution or move forward, please accept it as a solution for other community members

avatar
Explorer

I don't know if these solutions works for me.

What I really wanna do is make ExecuteSQL work as (for example) SelectHiveQL, which means, ExecuteSQL get only one incoming flow file and sends forward multiple flows file that can be merged.

 

My real problem is that the ExecuteSQL query sometimes needs to produce a flowfile whose size is too large for the edge machine to process, what ends in error.. so, I need to split the flow file to decrease the pressure on the edge machine... 
I've been through the same situation in Hive queries, but I solved using the "Max Rows Per Flow File" property and then merging the flowfiles.

avatar
Contributor

Why not use the 2nd option i said above....Use splitcontent or splitrecord and then merge it later whenever you want it. 

avatar
Explorer

My point is, the error occurs when the ExecuteSQL is running. The cause is: ExecuteSQL needs to create a huge flow file that the edge machine doest not have enough processing power to create.. Your solutions looks good, but it would split the flow file after it was produced by the ExecuteSQL, your solution takes action in a moment when the error has already occurred. 

avatar
Contributor

Then im afraid its difficult to do so. I dont understand how you are feeding the queries to execute sql. Maybe its good to feed executesql with manageable queries. If you are using GenerateTableFetch then it allows you to break a big query into smaller queries like you want and feed it to ExecuteSQL. Hope this helps. Please do post back on how to managed to move forward.

avatar
Explorer

The process is triggered by a GetSFTP that retrieves a date file.. Then the query uses the date for filtering..

Thanks for your contribution.. As soon as I make progress I'll update you guys.

avatar
Explorer

Maybe I have found a solution..

I'm gonna use the ExecuteSQL to do a "select insert" query.. The query will perform the joins and load the data into a table. Then the QueryDatabaseTable will read from the new table.. That way I'll be able to use the "Max Rows Per Floe File" property.