Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎04-23-2019

Error message from ODBC connection

Hi, 

I keep getting error messages while querying with ODBC connector to Impala (using pyodbc package for python):

 

pyodbc.OperationalError: ('08S01', '[08S01] [Cloudera][ImpalaODBC] (120) Error while retrieving data from in Impala: [08S01] : ImpalaThriftAPICallFailed (120) (SQLFetch)')

 

I have a smaller table (table1) and a bigger table (table2).

With the smaller table I tried with one and 10 processses and everything worked fine.

When I started 50 parallel python process, each having separate connection to Impala with pyodbc, after a few seconds I got the error message above (when calling cursor.fetchmany(1000) function).

With the bigger table, I got the error even with 1 process.

 

Client Setup:

Windows 10 + official Impala ODBC driver

The python program creates a process, connects with pyodbc to Impala and executes queries for 3 minutes. Then closes the cursor and the connection.

 

Cluster Setup:

1 master + 4 tablet server

Impala 3.1.0, Kudu 1.8.0 (CDH 6.1 with default parameters)

Data stored in Kudu table1: ~0.7 10^9 row (with 7 columns)

Data stored in Kudu table2: ~18 10^9 row (with 7 columns)

 

Additional notes:

I also got this error while using CentOS 7 + official Impala ODBC driver

While using LIMIT 100 on the queries, still got this error, but previously it happened earlier.

While using JDBC connector, everything worked fine for 1, 10 and 50 processes.

 

 

Posts: 1,885
Kudos: 422
Solutions: 298
Registered: ‎07-31-2013

Re: Error message from ODBC connection

Are all of your processes connecting onto the same Impala Daemon, or are you using a load balancer / varying connection options?

Each Impala Daemon can only accept a finite total number of active client connections, which is likely what you are running into.

Typically for concurrent access to a DB, it is better to use a connection pooling pattern with finite connections shared between threads of a single application. This avoids overloading a target server.

While I haven't used it, pyodbc may support connection pooling and reuse which you can utilise via threads in python, instead of creating separate processes.

Alternatively, spread the connections around, either by introducing a load balancer, or by varying the target options for each spawned process. See https://www.cloudera.com/documentation/enterprise/latest/topics/impala_dedicated_coordinator.html and http://www.cloudera.com/documentation/other/reference-architecture/PDF/Impala-HA-with-F5-BIG-IP.pdf for further guidance and examples on this.