Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Impala ODBC/JDBC bad performance - rows fetch is very slow from a remote server compared with NN

avatar
Master Collaborator

Hi,

In NameNode when I run the query via odbc script (php/perl or python), I can fetchAll results (9.2M) in a variable in about 30 seconds, but when I tried with the same script/query on another remote 2 servers, the execution time was in first server 28 min and in second 17 min.

To exclude the assuming that it's a network speed issue, I fetch the result on a file and then I copied it with scp command to the first remote server, and it finished in ~40 seconds.

What I observe in Query info in CM that there is a big different between the *Threads: Network Send Wait Time* values in the 3 queries :

in NN query: 9.40s
in 1st remote server: 16.7m
in 2nd remote server: 26.8m

And also I try an java script with Impala JDBC, but the results in NN are already not stimulate to continue.

But so far I cann't find where is the problem and how can I resolve it.

NB: I'm working on CDH 5.12.0/Impala 2.9.0, I installed Impala ODBC 2.5.37.1014.

 


Hope you can respode to me ASAP, because this issue is a realy an obstacle of using a cluster that we made a several months to make it.
Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hi all,

Finaly and after almost 6 months I have found the solution!

It was always about my 1024 limitition remark, the row batch limitation was from BATCH_SIZE max value (1024), in the last versions (CDH 5.14/Impala 2.11) we have a new effective range is 1-65536.
1-1024: https://www.cloudera.com/documentation/enterprise/5-12-x/topics/impala_batch_size.html
1-65536: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_batch_size.html

So when I increase it throgh a odbc.ini with SSP_BATCH_SIZE I can benifit from increasing the other odbc parameters (RowsFetchedPerBlock / TSaslTransportBufSize) and the rows can be fetched in a seconds (~45 secs) instead of tens of minutes.

Remark: I have been recreated the cluster in 3 different server providers and tested the connections from almost 5 others with different ODBC/JDBC releases etc.. and always I have the same slowness until this update came.
I can not understand why I'm the only on declared this big issue and why no one can answer me, kowning that it's realy depressed to have a good quering engine but a veeeery slow fetch rows!

Any way, thanks all for your replies.

View solution in original post

17 REPLIES 17

avatar

In Impala 2.11 we actually capped the max batch_size setting. Before that you could set it to an arbitrarily high value, which could have strange consequences. It's still a bit of a use-at-your-own-risk setting since it can have consequences for memory consumption and performance.

 

The real fix for this would be https://issues.apache.org/jira/browse/IMPALA-1618. Setting batch_size is just a workaround that may or may not work for you.

avatar
Master Collaborator

Hi @Tim Armstrong

While IMPALA-1618 steel open and unresolved, I confirmed that this "workaround" is safe and efficient (I'm using it on a large scope and during more than 9 months) so that this is the only solution I find to solve or -get around- this big problem.

Hope that the main problem will be fixed ASAP.
Thanks for the remark.

avatar
Explorer

Can you tell me the way to set the BATCH_SIZE for impala jdbc connection? I tried but it is not working for me.

avatar
Master Collaborator

Hi @Bishnup

ConfiguringServer-SideProperties
When connecting to a server that is running Impala 2.0 or later, you can use the driver to apply configuration properties to the server by setting the properties in the connection URL.
https://www.cloudera.com/documentation/other/connectors/impala-jdbc/latest/Cloudera-JDBC-Driver-for-...

Good luck.

avatar
Explorer

Hi @AcharkiMed

 

I tried setting the Batch size in the connection URL but I didn't get any performance boost in the query fetching time. I have posted my usecase in the cloudera forum. Kindly answer my questions :

 

 

avatar
Master Collaborator

Hi,

Please try to change all these 3 params:

TSaslTransportBufSize=4000;
RowsFetchedPerBlock=60536;
SSP_BATCH_SIZE=60536;

avatar
Explorer

Hi @AcharkiMed

 

As you suggested me to set 

TSaslTransportBufSize=4000;
RowsFetchedPerBlock=60536;
SSP_BATCH_SIZE=60536;

in the connection URL. I did the changes but i am getting these errors

java.sql.SQLException: [Simba][ImpalaJDBCDriver](500051) ERROR processing query/ statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:H Y000, errorMessage:Invalid query option: SSP_BATCH_SIZE
), Query: SET SSP_BATCH_SIZE=60536.
        at com.cloudera.hivecommon.api.HS2Client.executeStatementInternal(Unknow n Source) ~[Impala-JDBC-41-1.0.0.jar!/:na]

 and 

java.sql.SQLException: [Simba][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:Invalid query option: TSaslTransportBufSize
), Query: SET TSaslTransportBufSize=4000.

Help me set up the property.

 

Thank You,

Bishnu

avatar
Master Collaborator

Hi @Bishnup

If you still have the same problem please try to share with us your URL string.