Support Questions

uma66 · ‎03-05-2019

Hi,

There is a program that uses Impyla to retrieve data from the local Impala daemon.

cursor.execute("select * from table;")
rows = cursor.fetchall()

The table has 5 million rows, the number of columns is 9, the file size at the time of CSV conversion is about 200 MB.

There are four data nodes.Memory is 32 GB.

Despite just that much data, fetchall () takes over 200 seconds.

Query execution ends in 0.2 seconds

Why is it so slow?

Do you have any ideas to speed up something?

Thanks!

Tim Armstrong · ‎03-07-2019

Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala.

If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.

View solution in original post

Tim Armstrong · ‎03-05-2019

Impala is a streaming SQL engine so query execution can actually happen at the same time as rows are returned to the client. In your case, we don't scan the whole table, put the rows somewhere, then return the rows to the client. Rather Impala just returns rows to the client at the same time as it's scanning the table.

The bottleneck is likely in the client or network. Impyla is not particularly fast at parsing incoming rows and converting them into python objects. The Impala server is much much much faster. There's also a known issue that means that latency between the client and network can affect the time taken to return rows: https://issues.apache.org/jira/browse/IMPALA-1618.

uma66 · ‎03-05-2019

Thank you for answering.

That means that it "cursor.fetchall()" contains hdfs scan time.

On the other hand, the bottleneck is not on "hdfs scan" but on the client or network.

I checked below, but I interpreted this problem as occurring in the case of specifying a size smaller than the default batch size.
https://issues.apache.org/jira/browse/IMPALA-1618

It is questionable whether there is a possibility of occurrence even when using "cursor.fetchall()".

I have found an issue that shows the same thing.
https://github.com/cloudera/impyla/issues/239

Wes McKinney says it is a problem of hs2client.

Somehow I understood that there was no solution....

Thanks!

Tim Armstrong · ‎03-07-2019

Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala.

If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.

Cloudera Community

Support Questions

Impyla bad performance - rows fetch is very slow