About Tim Armstrong

Tim Armstrong · ‎08-21-2017

Impala doesn't support this Hive SerDe. In general Impala uses it's own optimised parsing code instead of using Hive's SerDe infrastructure. If you're ingesting data from CSV and using the SerDe to do the conversion, I'd recommend using Hive to do the ETL to convert to a more efficient storage format, e.g. Parquet.

Tim Armstrong · ‎08-14-2017

Do you have a profile for any of the slow queries? I see some HDFS bugs that suggest that there may be problems along these lines when running mismatched versions of the HDFS client and HDFS datanode: https://issues.apache.org/jira/browse/HDFS-8070.

Tim Armstrong · ‎08-11-2017

It might be from_utc_timestamp(), which is being evaluated as part of the aggregation. It has some known perf issues IMPALA-1577 It might be faster if you could restructure the query so that you evaluated from_utc_timestamp(max(...)) instead of max(from_utc_timestamp(...)) or otherwise found another way to reduce the number of rows feeding into the aggregation.

Tim Armstrong · ‎08-11-2017

Llama is deprecated and shouldn't be in use - you should be safe to ignore that option.

Tim Armstrong · ‎08-11-2017

1) If you didn't set a memory limit for the query then the query may expand up to the process memory limit. I.e. the query memory limit is effectively process memory limit. This is a pretty bad configuration for concurrent queries since the queries end up fighting it out for memory. 2) To answer your CM question directly, you can get the relevant metrics from the timeseries API. tcmalloc_physical_bytes_reserved_across_impalads is the process consumption and mem_tracker_process_limit_across_impalads is the limit. If you paste this into the Chart Builder you can see the averages of the two: SELECT tcmalloc_physical_bytes_reserved_across_impalads, mem_tracker_process_limit_across_impalads WHERE entityName = "IMPALA-1" AND category = SERVICE I'm wondering though if setting up admission control with resource pools and default query memory limits would solve your problem better thana custom solution: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html

Tim Armstrong · ‎08-11-2017

Please feel free to post the profile. Compression may make it faster or slower, depending on whether the bottleneck for the query is CPU or IO.

Tim Armstrong · ‎08-11-2017

It's normal for idle Impala daemons to hold onto 1-2GB of memory plus the JVM heap memory. Any more than that may indicate something wrong. The memz debug page has diagnostics for this: http://impala-daemon:25000/memz?detailed=true . You can see there if a fragment of a query is holding onto memory. There are two common ways this can happen: * The query wasn't cancelled and closed. In this case it should show up on the /queries page of the coordinator * The query was cancelled and closed, but a fragment continued running (this is a bug). In that case you should see a running fragment-execution thread in /threadz on the impala daemon web page.

Tim Armstrong · ‎08-08-2017

Great to hear! CM's time series support is really pretty powerful.

Tim Armstrong · ‎08-08-2017

If the metrics are available in CM (e.g. if they're in the Impala Charts Library in CM, or you can query them in CM) then you should be able to get them through the CM timeseries API https://cloudera.github.io/cm_api/apidocs/v15/path__timeseries.html CM is capable of collecting all of Impala's metrics from the /metrics page, but it currently only collects a subset of them (the ones we think are most important for cluster monitoring). I'd be interested in knowing if there are useful metrics that are not currently collected by CM, and particularly in understanding the kind of high-level questions that you're trying to answer with metrics, so I can understand if there are metrics that could be added to Impala.

Tim Armstrong · ‎07-12-2017

@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Cannot query Hive table created with OpenCSVSe...

Re: Impala running slow after upgrade

Re: Impala max/min slow

Re: impala java heap config

Re: impala query memory limit

Re: Impala max/min slow

Re: Should Impala release memory after use?

Re: Getting impala daemon serves via cloudera rest...

Re: Getting impala daemon serves via cloudera rest...

Re: combine small parquet files