Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impyla bad performance - rows fetch is very slow

avatar
Explorer

Hi,

 

There is a program that uses Impyla to retrieve data from the local Impala daemon.

 

cursor.execute("select * from table;")
rows = cursor.fetchall()

 

The table has 5 million rows, the number of columns is 9, the file size at the time of CSV conversion is about 200 MB.

There are four data nodes.Memory is 32 GB.

 

Despite just that much data, fetchall () takes over 200 seconds.

 

Query execution ends in 0.2 seconds

 

Why is it so slow?

 

Do you have any ideas to speed up something?

 

 

Thanks!

 

1 ACCEPTED SOLUTION

avatar

Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala.

 

If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.

View solution in original post

3 REPLIES 3

avatar

Impala is a streaming SQL engine so query execution can actually happen at the same time as rows are returned to the client. In your case, we don't scan the whole table, put the rows somewhere, then return the rows to the client. Rather Impala just returns rows to the client at the same time as it's scanning the table.

 

The bottleneck is likely in the client or network. Impyla is not particularly fast at parsing incoming rows and converting them into python objects. The Impala server is much much much faster. There's also a known issue that means that latency between the client and network can affect the time taken to return rows: https://issues.apache.org/jira/browse/IMPALA-1618. 

avatar
Explorer

Thank you for answering.

 

That means that it "cursor.fetchall()" contains hdfs scan time.

On the other hand, the bottleneck is not on "hdfs scan" but on the client or network.

 

I checked below, but I interpreted this problem as occurring in the case of specifying a size smaller than the default batch size.
https://issues.apache.org/jira/browse/IMPALA-1618

 

It is questionable whether there is a possibility of occurrence even when using "cursor.fetchall()".

 

I have found an issue that shows the same thing.
https://github.com/cloudera/impyla/issues/239

 

Wes McKinney says it is a problem of hs2client.

 

Somehow I understood that there was no solution....

 

Thanks!

 

avatar

Yeah we need to make some changes in Impala to optimise this case (large SELECT result sets) better. We have some of that work in Impala.

 

If you're doing large extracts of data, it's often better to do a "CREATE TABLE AS SELECT" into a text table and download those files directly from the filesystem, if that's possible.