Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala ODBC/JDBC bad performance - rows fetch is very slow from a remote server compared with NN

avatar
Master Collaborator

Hi,

In NameNode when I run the query via odbc script (php/perl or python), I can fetchAll results (9.2M) in a variable in about 30 seconds, but when I tried with the same script/query on another remote 2 servers, the execution time was in first server 28 min and in second 17 min.

To exclude the assuming that it's a network speed issue, I fetch the result on a file and then I copied it with scp command to the first remote server, and it finished in ~40 seconds.

What I observe in Query info in CM that there is a big different between the *Threads: Network Send Wait Time* values in the 3 queries :

in NN query: 9.40s
in 1st remote server: 16.7m
in 2nd remote server: 26.8m

And also I try an java script with Impala JDBC, but the results in NN are already not stimulate to continue.

But so far I cann't find where is the problem and how can I resolve it.

NB: I'm working on CDH 5.12.0/Impala 2.9.0, I installed Impala ODBC 2.5.37.1014.

 


Hope you can respode to me ASAP, because this issue is a realy an obstacle of using a cluster that we made a several months to make it.
Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hi all,

Finaly and after almost 6 months I have found the solution!

It was always about my 1024 limitition remark, the row batch limitation was from BATCH_SIZE max value (1024), in the last versions (CDH 5.14/Impala 2.11) we have a new effective range is 1-65536.
1-1024: https://www.cloudera.com/documentation/enterprise/5-12-x/topics/impala_batch_size.html
1-65536: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_batch_size.html

So when I increase it throgh a odbc.ini with SSP_BATCH_SIZE I can benifit from increasing the other odbc parameters (RowsFetchedPerBlock / TSaslTransportBufSize) and the rows can be fetched in a seconds (~45 secs) instead of tens of minutes.

Remark: I have been recreated the cluster in 3 different server providers and tested the connections from almost 5 others with different ODBC/JDBC releases etc.. and always I have the same slowness until this update came.
I can not understand why I'm the only on declared this big issue and why no one can answer me, kowning that it's realy depressed to have a good quering engine but a veeeery slow fetch rows!

Any way, thanks all for your replies.

View solution in original post

17 REPLIES 17

avatar

In Impala 2.11 we actually capped the max batch_size setting. Before that you could set it to an arbitrarily high value, which could have strange consequences. It's still a bit of a use-at-your-own-risk setting since it can have consequences for memory consumption and performance.

 

The real fix for this would be https://issues.apache.org/jira/browse/IMPALA-1618. Setting batch_size is just a workaround that may or may not work for you.

avatar
Master Collaborator

Hi @Tim Armstrong

While IMPALA-1618 steel open and unresolved, I confirmed that this "workaround" is safe and efficient (I'm using it on a large scope and during more than 9 months) so that this is the only solution I find to solve or -get around- this big problem.

Hope that the main problem will be fixed ASAP.
Thanks for the remark.

avatar
Explorer

Can you tell me the way to set the BATCH_SIZE for impala jdbc connection? I tried but it is not working for me.

avatar
Master Collaborator

Hi @Bishnup

ConfiguringServer-SideProperties
When connecting to a server that is running Impala 2.0 or later, you can use the driver to apply configuration properties to the server by setting the properties in the connection URL.
https://www.cloudera.com/documentation/other/connectors/impala-jdbc/latest/Cloudera-JDBC-Driver-for-...

Good luck.

avatar
Explorer

Hi @AcharkiMed

 

I tried setting the Batch size in the connection URL but I didn't get any performance boost in the query fetching time. I have posted my usecase in the cloudera forum. Kindly answer my questions :

 

 

avatar
Master Collaborator

Hi,

Please try to change all these 3 params:

TSaslTransportBufSize=4000;
RowsFetchedPerBlock=60536;
SSP_BATCH_SIZE=60536;

avatar
Explorer

Hi @AcharkiMed

 

As you suggested me to set 

TSaslTransportBufSize=4000;
RowsFetchedPerBlock=60536;
SSP_BATCH_SIZE=60536;

in the connection URL. I did the changes but i am getting these errors

java.sql.SQLException: [Simba][ImpalaJDBCDriver](500051) ERROR processing query/ statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:H Y000, errorMessage:Invalid query option: SSP_BATCH_SIZE
), Query: SET SSP_BATCH_SIZE=60536.
        at com.cloudera.hivecommon.api.HS2Client.executeStatementInternal(Unknow n Source) ~[Impala-JDBC-41-1.0.0.jar!/:na]

 and 

java.sql.SQLException: [Simba][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:Invalid query option: TSaslTransportBufSize
), Query: SET TSaslTransportBufSize=4000.

Help me set up the property.

 

Thank You,

Bishnu

avatar
Master Collaborator

Hi @Bishnup

If you still have the same problem please try to share with us your URL string.