About Tim Armstrong

Tim Armstrong · ‎06-23-2016

I don't think it's on the immediate roadmap, our focus recently has been on various other things (performance, Amazon EC2 support, etc)

Tim Armstrong · ‎06-23-2016

That feature has not made it in unfortunately. The documentation at http://www.cloudera.com/documentation.html is the source of truth about what features are or are not present.

Tim Armstrong · ‎06-17-2016

Ok, that's interesting. The nonsense-looking symbol up the top is probably jitted code from your query, probably an expression or something like that. 36.50% perf-18476.map [.] 0x00007f3c1d634b82 The other symbols like GetDoubleVal() may be what is calling this expensive function. It looks like it's possible ProbeTime in the profile that's the culprit. Can you share the SQL for your query at all? I'm guessing that there's some expression in your query that's expensive to evaluate. E.g. joining on some complex expression, or doing some kind of expensive computation.

Tim Armstrong · ‎06-17-2016

Maybe run 'perf top' to see where it's spending the time? I'd expect the scan to run on one core and the join and insert to run on a different core.

Tim Armstrong · ‎06-17-2016

There's something strange going on here, the profile reports that the scan took around 12 seconds of CPU time, but 17 minutes of wall-clock time. So for whatever reason the scan is spending most of its time swapped out and unable to execute. - MaterializeTupleTime(*): 17m20s - ScannerThreadsSysTime: 74.049ms - ScannerThreadsUserTime: 12s312ms Is the system under heavy load or is it swapping to disk?

Tim Armstrong · ‎05-23-2016

Yes, that's right. It's enabled for some cases by default (broadcast joins) in Impala 2.5. To enable it for a wider category of joins you can set the query option runtime_filter_mode=global. This setting will become the default in Impala 2.7 because of the performance benefits.

Tim Armstrong · ‎05-02-2016

We often use TPC-H and TPC-DS, they're pretty standard for analytical databases. There's a TPC-DS kit for Impala here: https://github.com/cloudera/impala-tpcds-kit

Tim Armstrong · ‎05-02-2016

There's no direct way to find out from the profile unfortunately. If you have a live system you can look at the /threadz page on the impala debug web page (port 25000 on each Impala daemon by default) to see how many hdfs-scan-node threads are running.

Tim Armstrong · ‎04-29-2016

Impala limits the number of threads executing the query plan by design. Impala dynamically increases the number of scanner threads provided there are CPU and memory resources available - in this case it seems like there weren't CPU resource available. If the machine is already busy adding more threads can actually decrease query throughput.

Tim Armstrong · ‎04-28-2016

The algorithm is described in the Impala source code here if you (or anyone else reading) is interested: https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/partitioned-hash-join-node.h

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	140

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: How to create impala derived tables

Re: How to create impala derived tables

Re: Improve impala execution rate?

Re: Improve impala execution rate?

Re: Improve impala execution rate?

Re: Partition Pruning without constant in query

Re: impala memory limit exceed

Re: NumScannerThreadsStarted of a specified node i...

Re: NumScannerThreadsStarted of a specified node i...

Re: impala memory limit exceed