About Tim Armstrong

Tim Armstrong · ‎05-23-2016

Yes, that's right. It's enabled for some cases by default (broadcast joins) in Impala 2.5. To enable it for a wider category of joins you can set the query option runtime_filter_mode=global. This setting will become the default in Impala 2.7 because of the performance benefits.

Tim Armstrong · ‎05-02-2016

There's no direct way to find out from the profile unfortunately. If you have a live system you can look at the /threadz page on the impala debug web page (port 25000 on each Impala daemon by default) to see how many hdfs-scan-node threads are running.

jarourbtb · ‎04-25-2016

Thank you very much, Tim and Ivan. I have tested on customer system and with "set disable_codegen=1;" the query does NOT crash the whole impala cluster anymore. I will inform the customer and provide him with a link to this discussion

ZL · ‎04-22-2016

Actually it is datanode doing it. I guess I'll ask more about it as an HDFS topic. Thanks!

Tim Armstrong · ‎02-23-2016

Hi, There are many possible variables, including the exact version of impala, the operating system it was built on, the build flags and environment variables, and what version/build of dependencies you're using. I think the specific thing you're probably seeing with file sizes is that in the CDH distribution the debug symbols are stripped from the binaries and included in separate impalad.debug files. Are you running into some error when trying to run your custom build of Impala? It probably makes more sense to debug that problem rather than trying to exactly reproduce Cloudera's build.

Tim Armstrong · ‎01-27-2016

You are most likely running into this bug with the aggregation: https://issues.cloudera.org/browse/IMPALA-2352 We fixed it in CDH5.5/Impala 2.3 but the change wasn't backported because it was deemed too risky for a maintenance release.

Tim Armstrong · ‎01-20-2016

You can create a 1-row dummy table like this: select 1 id, 'a' d from (select 1) dual where 1 = 1 You also have to rewrite the query to avoid an uncorrelated not exists. You can do something like: select 1 id, 'a' d from (select 1) dual where (select count(*) from employee where empid > 20000) = 0 Computing the count might be expensive so you could add a limit like select 1 id, 'a' d from (select 1) dual where (select count(*) from (select id from employee where empid > 20000 limit 1) emp) = 0

Tim Armstrong · ‎01-18-2016

If you can switch to Parquet, that's probably the best solution: it's generally the most performant file format for reading and produces the smallest file sizes. If for some reason you need to stick with text, the uncompressed data size needs to be < 1GB per file.

Tim Armstrong · ‎12-27-2015

The advice in this thread is out of date: memory usage for joins and aggregations has been improved a lot in CDH5.5. Your issue is something different since the query doesn't have a join or group by (aggregation) in it. The first step to understand this better is to look at the impalad logs: there is usually some information in there about why the memory limit was exceeded and what operators were consuming memory.

Tim Armstrong · ‎12-18-2015

I misread your question and didn't realise you wanted the per-host peak, PerHostPeakMemUsage gives you exactly what you want.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	140

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Partition Pruning without constant in query

Re: NumScannerThreadsStarted of a specified node i...

Re: query causes Impala to crash

Re: What triggers du in Impala?

Re: problem about building impala

Re: Unexpected Spill to Disk Activity

Re: impala alternative to oracle dual table

Re: impala-shell returns impalad: TSocket read 0 b...

Re: Backend 6:Memory Limit Exceeded" in impala 2 (...

Re: Overall peak memory usage of a query?