question Re: Impyla bad performance - rows fetch is very slow in Support Questions

Impyla bad performance - rows fetch is very slow

uma66 — Fri, 16 Sep 2022 14:12:35 GMT

Hi,

There is a program that uses Impyla to retrieve data from the local Impala daemon.

cursor.execute("select * from table;")
rows = cursor.fetchall()

The table has 5 million rows, the number of columns is 9, the file size at the time of CSV conversion is about 200 MB.

There are four data nodes.Memory is 32 GB.

Despite just that much data, fetchall () takes over 200 seconds.

Query execution ends in 0.2 seconds

Why is it so slow?

Do you have any ideas to speed up something?

Thanks!

Re: Impyla bad performance - rows fetch is very slow

Tim Armstrong — Tue, 05 Mar 2019 17:54:10 GMT

Impala is a streaming SQL engine so query execution can actually happen at the same time as rows are returned to the client. In your case, we don't scan the whole table, put the rows somewhere, then return the rows to the client. Rather Impala just returns rows to the client at the same time as it's scanning the table.

The bottleneck is likely in the client or network. Impyla is not particularly fast at parsing incoming rows and converting them into python objects. The Impala server is much much much faster. There's also a known issue that means that latency between the client and network can affect the time taken to return rows: https://issues.apache.org/jira/browse/IMPALA-1618.

Re: Impyla bad performance - rows fetch is very slow

uma66 — Wed, 06 Mar 2019 02:24:10 GMT

Thank you for answering.

That means that it "cursor.fetchall()" contains hdfs scan time.

On the other hand, the bottleneck is not on "hdfs scan" but on the client or network.

I checked below, but I interpreted this problem as occurring in the case of specifying a size smaller than the default batch size.
https://issues.apache.org/jira/browse/IMPALA-1618

It is questionable whether there is a possibility of occurrence even when using "cursor.fetchall()".

I have found an issue that shows the same thing.
https://github.com/cloudera/impyla/issues/239

Wes McKinney says it is a problem of hs2client.

Somehow I understood that there was no solution....

Thanks!

Re: Impyla bad performance - rows fetch is very slow

Tim Armstrong — Thu, 07 Mar 2019 17:14:04 GMT

Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala.

If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.