Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ExecuteSQL query is very slow

Solved Go to solution
Highlighted

ExecuteSQL query is very slow

Hello guys,

I have the below NiFi flow, based on the thread

https://community.hortonworks.com/questions/70562/executesql-dynamic-query.html#comment-71349

The ExecuteSQL processor is executing very simple select query which returns the results very fast if executed from command line..

I dunno why I keep getting huge queue before the executeSQL, I tried to increase the running threads but still the output is slow...

any suggestion how to improve this ? bearing in mind that I have another process group which execute insert query and running smoothly, fast ?

10300-view.png

Note the queue is full before ExectueSQL, I also got heap size errors when I increased the ExecuteSQL threads to 16 ?

I attached the flow as well

Thanks forum

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: ExecuteSQL query is very slow

Super Guru

The approach in the other thread is very inefficient for this use case. You're basically trying to do a join between rows in a file and rows in a DB table. An alternative is to populate a DistributedMapCacheServer from the DB table, then look up those values in a separate flow.

To populate the map, you could do something like this:

10276-populatelookuptable.png

Here I am using QueryDatabaseTable with a Max Value Column of "id" such that the map will only be populated once. But if you are adding entries to the lookup table (as it appears you might be from your description) or if new entries will not have strictly greater values for "id", then you can remove the Max Value Column property and schedule the QueryDatabaseTable processor to run as often as you'd like to refresh the values.

Once this flow is running, you can start a different flow that is similar to the one in the other thread, but instead of querying the DB for each row in the file, it will fetch from the DistributedCacheMapServer, which is hopefully faster:

10277-performlookup.png

You can see the first part is the same as the flow in the other thread, but instead of using ReplaceText to generate SQL to execute, the value is simply looked up from the Map and put into an attribute, then the final ReplaceText is like the one in the other thread, specifying "${column.1},${column.2},${column.3},${column.4}, ${customer.name}" or whatever the appropriate attributes are. I have attached a template (databaselookupexample.xml) showing these two flows.

View solution in original post

3 REPLIES 3
Highlighted

Re: ExecuteSQL query is very slow

Super Guru

The approach in the other thread is very inefficient for this use case. You're basically trying to do a join between rows in a file and rows in a DB table. An alternative is to populate a DistributedMapCacheServer from the DB table, then look up those values in a separate flow.

To populate the map, you could do something like this:

10276-populatelookuptable.png

Here I am using QueryDatabaseTable with a Max Value Column of "id" such that the map will only be populated once. But if you are adding entries to the lookup table (as it appears you might be from your description) or if new entries will not have strictly greater values for "id", then you can remove the Max Value Column property and schedule the QueryDatabaseTable processor to run as often as you'd like to refresh the values.

Once this flow is running, you can start a different flow that is similar to the one in the other thread, but instead of querying the DB for each row in the file, it will fetch from the DistributedCacheMapServer, which is hopefully faster:

10277-performlookup.png

You can see the first part is the same as the flow in the other thread, but instead of using ReplaceText to generate SQL to execute, the value is simply looked up from the Map and put into an attribute, then the final ReplaceText is like the one in the other thread, specifying "${column.1},${column.2},${column.3},${column.4}, ${customer.name}" or whatever the appropriate attributes are. I have attached a template (databaselookupexample.xml) showing these two flows.

View solution in original post

Highlighted

Re: ExecuteSQL query is very slow

FetchDistrubutedMapCash is still not fast, though the table which was fetched is only 500 rows..

any suggestion.. ?

Also..

how can I reset the DistributedMapCacheServer content ?

I'm using the FetchDistributedMapCache to compare ${column.3} with DistributedMapCacheServer content, is it possible to use the same processor to compare ${column.4}, ?

Thank you @Matt Burgess

Re: ExecuteSQL query is very slow

New Contributor
The prefetch (fetch value) option of jdbc can improve your perfomance a lot. You can add this options as dynamic option in the connection pool. My performance improves with a factor 20 when I raised it to 2000. JDBC uses a default value 10 when you don't specify a value.
Don't have an account?
Coming from Hortonworks? Activate your account here