Reply
Highlighted
Explorer
Posts: 9
Registered: ‎09-04-2018
Accepted Solution

Impyla bad performance - rows fetch is very slow

Hi,

 

There is a program that uses Impyla to retrieve data from the local Impala daemon.

 

cursor.execute("select * from table;")
rows = cursor.fetchall()

 

The table has 5 million rows, the number of columns is 9, the file size at the time of CSV conversion is about 200 MB.

There are four data nodes.Memory is 32 GB.

 

Despite just that much data, fetchall () takes over 200 seconds.

 

Query execution ends in 0.2 seconds

 

Why is it so slow?

 

Do you have any ideas to speed up something?

 

 

Thanks!

 

Cloudera Employee
Posts: 395
Registered: ‎07-29-2015

Re: Impyla bad performance - rows fetch is very slow

Impala is a streaming SQL engine so query execution can actually happen at the same time as rows are returned to the client. In your case, we don't scan the whole table, put the rows somewhere, then return the rows to the client. Rather Impala just returns rows to the client at the same time as it's scanning the table.

 

The bottleneck is likely in the client or network. Impyla is not particularly fast at parsing incoming rows and converting them into python objects. The Impala server is much much much faster. There's also a known issue that means that latency between the client and network can affect the time taken to return rows: https://issues.apache.org/jira/browse/IMPALA-1618. 

Explorer
Posts: 9
Registered: ‎09-04-2018

Re: Impyla bad performance - rows fetch is very slow

Thank you for answering.

 

That means that it "cursor.fetchall()" contains hdfs scan time.

On the other hand, the bottleneck is not on "hdfs scan" but on the client or network.

 

I checked below, but I interpreted this problem as occurring in the case of specifying a size smaller than the default batch size.
https://issues.apache.org/jira/browse/IMPALA-1618

 

It is questionable whether there is a possibility of occurrence even when using "cursor.fetchall()".

 

I have found an issue that shows the same thing.
https://github.com/cloudera/impyla/issues/239

 

Wes McKinney says it is a problem of hs2client.

 

Somehow I understood that there was no solution....

 

Thanks!

 

Cloudera Employee
Posts: 395
Registered: ‎07-29-2015

Re: Impyla bad performance - rows fetch is very slow

Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala.

 

If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.

Announcements