Member since
07-29-2015
535
Posts
141
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7611 | 12-18-2020 01:46 PM | |
4985 | 12-16-2020 12:11 PM | |
3799 | 12-07-2020 01:47 PM | |
2472 | 12-07-2020 09:21 AM | |
1613 | 10-14-2020 11:15 AM |
08-21-2017
11:42 AM
1 Kudo
Impala doesn't support this Hive SerDe. In general Impala uses it's own optimised parsing code instead of using Hive's SerDe infrastructure. If you're ingesting data from CSV and using the SerDe to do the conversion, I'd recommend using Hive to do the ETL to convert to a more efficient storage format, e.g. Parquet.
... View more
08-14-2017
11:49 AM
Do you have a profile for any of the slow queries? I see some HDFS bugs that suggest that there may be problems along these lines when running mismatched versions of the HDFS client and HDFS datanode: https://issues.apache.org/jira/browse/HDFS-8070.
... View more
08-11-2017
02:55 PM
It might be from_utc_timestamp(), which is being evaluated as part of the aggregation. It has some known perf issues IMPALA-1577 It might be faster if you could restructure the query so that you evaluated from_utc_timestamp(max(...)) instead of max(from_utc_timestamp(...)) or otherwise found another way to reduce the number of rows feeding into the aggregation.
... View more
08-11-2017
08:15 AM
Llama is deprecated and shouldn't be in use - you should be safe to ignore that option.
... View more
08-11-2017
08:13 AM
1) If you didn't set a memory limit for the query then the query may expand up to the process memory limit. I.e. the query memory limit is effectively process memory limit. This is a pretty bad configuration for concurrent queries since the queries end up fighting it out for memory. 2) To answer your CM question directly, you can get the relevant metrics from the timeseries API. tcmalloc_physical_bytes_reserved_across_impalads is the process consumption and mem_tracker_process_limit_across_impalads is the limit. If you paste this into the Chart Builder you can see the averages of the two: SELECT tcmalloc_physical_bytes_reserved_across_impalads, mem_tracker_process_limit_across_impalads WHERE entityName = "IMPALA-1" AND category = SERVICE I'm wondering though if setting up admission control with resource pools and default query memory limits would solve your problem better thana custom solution: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html
... View more
08-11-2017
08:02 AM
Please feel free to post the profile. Compression may make it faster or slower, depending on whether the bottleneck for the query is CPU or IO.
... View more
08-11-2017
07:55 AM
1 Kudo
It's normal for idle Impala daemons to hold onto 1-2GB of memory plus the JVM heap memory. Any more than that may indicate something wrong. The memz debug page has diagnostics for this: http://impala-daemon:25000/memz?detailed=true . You can see there if a fragment of a query is holding onto memory. There are two common ways this can happen: * The query wasn't cancelled and closed. In this case it should show up on the /queries page of the coordinator * The query was cancelled and closed, but a fragment continued running (this is a bug). In that case you should see a running fragment-execution thread in /threadz on the impala daemon web page.
... View more
08-08-2017
01:55 PM
Great to hear! CM's time series support is really pretty powerful.
... View more
08-08-2017
09:08 AM
1 Kudo
If the metrics are available in CM (e.g. if they're in the Impala Charts Library in CM, or you can query them in CM) then you should be able to get them through the CM timeseries API https://cloudera.github.io/cm_api/apidocs/v15/path__timeseries.html CM is capable of collecting all of Impala's metrics from the /metrics page, but it currently only collects a subset of them (the ones we think are most important for cluster monitoring). I'd be interested in knowing if there are useful metrics that are not currently collected by CM, and particularly in understanding the kind of high-level questions that you're trying to answer with metrics, so I can understand if there are metrics that could be added to Impala.
... View more
07-12-2017
02:46 PM
3 Kudos
@barnoba we strongly recommend *not* to use parquet-tools merge unless you really know what you're doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file - it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds - you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.
... View more