About Tim Armstrong

Tim Armstrong · ‎01-21-2020

The estimated stats size is calculated as 400 bytes * # columns * # partitions. The option prevents you from computing incremental stats on tables with too many columns and partitions (it guards against the scenario where memory usage from incremental stats creeps up and up as tables get larger, eventually causing an outage). So you probably want to set it based on the expected size of the largest table that you will be using incremental stats on (that would help prevent someone accidentally computing incremental stats on an even larger table). A few other comments. Generally non-incremental stats will be more robust but we understand that it's sometimes challenging or less practical to do a full compute stats on all tables. So if the calculation above spits out a huge number, you might want to reconsider that. You need to be careful with bumping *only* the catalog heap size. On versions prior to CDH5.16, you need all coordinator impala daemons to have a heap size as large as the catalogd, since the catalog cache is replicated. That was addressed for incremental stats specifically in CDH5.16 by *not* replicating the incremental stats (all other state is still replicated). In CDH5.16 the memory consumption was improved substantially as well (the incremental stats use ~5x less memory). The estimated stats size is actually reduce to 200 bytes * # columns * #partitions.

Tim Armstrong · ‎11-27-2019

I assume you're seeing something like what I attached here. I ran that query from impala-shell and it's going to sit in the FINISHED state for several minutes while the shell fetches all of the ~6 million rows. The reason is that it just takes a bunch of time for the client to fetch the results, even if it's a client like impala-shell or JDBC that actively fetches the results. You'll see this for queries with large result sets, particularly if the connection from client to server is slow or has higher latency. I'd expect that the client will eventually get to the end of the result set and close the query on its own. Hue is a bit different because it only fetches results as needed, so can hold queries open even if they have small result sets. There will be some significant perf and resource management improvements for use cases like this in the versions of Impala that come with CDP - e.g. https://issues.apache.org/jira/browse/IMPALA-8656 helps with this in various ways.

Tim Armstrong · ‎11-06-2019

Also if you have a support contract with Cloudera, this is something they can help you with in more detail through that channel, we've successfully resolved this for customers before.

Tim Armstrong · ‎11-06-2019

I updated the JIRA to include workarounds, just FYI.

Tim Armstrong · ‎11-06-2019

This generally happens when overwriting files in-place where Impala is still trying to read a cached version of the file. E.g. insert overwrite in Hive. So you can often avoid the problem if you can avoid doing that. Otherwise doing a REFRESH of the table should resolve it.

Tim Armstrong · ‎11-04-2019

Increasing the impala memory limit isn't likely to have any negative consequences for Impala. The only potential downside based on what you posted is that reducing the memory available to the linux to us for I/O caching could slow down scans. > I'd like to allow some queries that tend to overreach on Impala RAM additional capacity to do what they need to do. These queries read some big tables, sometimes with thousands of partitions, and they have a tendency to run out of RAM. If you haven't already, it's probably worth trying to tune the queries, since they're still going to be relatively slow even if they stay fully in memory. There no reason queries over big tables are necessarily memory intensive, it's usually only certain operations or suboptimal plans. There's some high-level advice here if you haven't already seen it: https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_scalability.html#spill_to_disk . The most common problem I see leading to unnecessary spilling is large right inputs to hash joins. If you can tweak the query to reduce that amount of data on that side of the join, or if the join order/strategy in the plan is suboptimal, you can get dramatic improvements. You probably already checked this, but if you don't have stats on all tables in the query, that's the first thing to try addressing. > Currently, I don't have any admission control settings enabled. Any query can use all the available resources. I'd like to increase the available RAM for all of Impala while limiting the RAM for individual queries. You didn't say what version you're using, but CDH6.1+ have some admission control settings that will automatically allocate variable amounts of memory to a query, within a min/max memory limit range that is configurable by the cluster admin. This is probably useful for you

Tim Armstrong · ‎10-23-2019

I'm less familiar with hive, but I think you have to do something like: select date_format(UNIX_TIMESTAMP('2019-Oct-14 20:00:01.027898', 'yyyy-MMM-dd HH:mm:ss.SSSSSS'), 'yyyy-MM-dd HH:mm:ss.SSSSSS');

Tim Armstrong · ‎10-23-2019

It looks like there was plenty of memory available in the system, that query just hit its individual memory limit. There were a lot of improvements to avoid out-of-memory between 5.15 and 6.1, particularly for queries with a lot of scans that use a significant amount of memory. It looks like one of the scans was using a large chunk of the query memory: HDFS_SCAN_NODE (id=0): Total=393.15 MB Peak=393.15 MB There's one specific regression that I'm aware of that affected Avro scans: https://issues.apache.org/jira/browse/IMPALA-7078. The fix is in 5.15.1 and 5.15.2. I don't know the file format but thought I'd flag that. The IMPALA-7078 fix actually had a few tweaks that would benefit all file formats too. So I'd suggest: Give the queries a bit more memory - in practice we've seen 2GB be a lot better with a wider variety of queries in CDH5.x. 1GB is a bit squeezy for a query with 49 operators. Pick up the 5.15.2 or 5.16.2 maintenance releases to get the fix for IMPALA-7078 - that may be enough to solve the problem. Look at CDH6.1, it does address a bunch of issues in this area more systematically - it moves the scan operations to use a much more robust memory throttling/reservation system (I spent a bunch of time last year working on problems in this general area). 1 GB might just not be enough to run a query with that many operators on the version of Impala that you're running.

Tim Armstrong · ‎10-23-2019

[localhost:21000] default> select to_timestamp('2019-Oct-14 20:00:01.027898', 'yyyy-MMM-dd HH:mm:ss.SSSSSS'); Query: select to_timestamp('2019-Oct-14 20:00:01.027898', 'yyyy-MMM-dd HH:mm:ss.SSSSSS') Query submitted at: 2019-10-23 14:08:19 (Coordinator: http://tarmstrong-box:25000) Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=0d4bd87f063c53a2:c8c5759b00000000 +----------------------------------------------------------------------------+ | to_timestamp('2019-oct-14 20:00:01.027898', 'yyyy-mmm-dd hh:mm:ss.ssssss') | +----------------------------------------------------------------------------+ | 2019-10-14 20:00:01.027898000 | +----------------------------------------------------------------------------+ Fetched 1 row(s) in 0.11s [localhost:21000] default> The default timestamp format accepted by Impala is ISO 8601 - https://en.wikipedia.org/wiki/ISO_8601 to_timestamp() lets you specify a format string if you want more flexibility about input timestamp formats - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_datetime_functions.html#datetime_functions__to_timestamp. You can see above how it might work.

Tim Armstrong · ‎09-20-2019

So it looks like column specific is only on a table without partitions (non-incremental) @hores that's incorrect, non-incremental compute stats works on partitioned tables and is generally the preferred method for collecting stats on partitioned tables. We've generally tried to steer people away from incremental stats because of the size issues on large tables, It would also be error-prone to use correctly and complex to implement - what happens if you compute incremental stats with different subsets of the columns? You can end up with different subsets of the columns on different partitions and then you have to somehow reconcile it all each time.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Impala inc_stats_size_limit_bytes - what does ...

Re: Impala query troubles

Re: impala has invalid file metadata

Re: impala has invalid file metadata

Re: impala has invalid file metadata

Re: Impala Daemon Memory Limit

Re: Impala - Convert String to Timestamp

Re: Impala Queries Out-of-Memory Stability Issues

Re: Impala - Convert String to Timestamp

Re: Impala compute incremental stats on specific c...