About Tim Armstrong

Tim Armstrong · ‎05-26-2017

In the summary you can see that the average time in Node 0 is 2s973ms.The time in scan nodes is the time spent waiting for scanner threads to return a row batch. That looks like the bottleneck. Drilling down into the profile for that node (HDFS_SCAN_NODE (id=0):(Total: 2s973ms, non-child: 2s973ms, % non-child: 100.00%)) we can see that it scanned a single Parquet file and spent almost all of that time materializing tuples (MaterializeTupleTime). So there are two things going on: Tuple materialisation is expensive. There's no parallelism available because there's only a single parquet file in the partition. Usually for large tables having large files is a good thing, because there's less overhead, but in this case you're losing out on some parallelism. We have a lot of changes in the pipeline at various stages to improve the first part - either by making it more efficient or by getting better at skipping over data. There's lots of room for improvement. Unfortunately that doesn't help you immediately.

Tim Armstrong · ‎05-25-2017

That query probably has multiple big joins and aggregations and needs more memory to complete. A very rough rule of thumb for minimum memory in releases CDH5.9-CDH5.12 is the following. For each hash join, the minimum of 150MB or the amount of data on the right side of the node (e.g. if you have a few thousand rows on the right side, maybe a MB or two). For each merge aggregation, the minimum of 300MB or the size of grouped data in-memory (e.g. if you only have a few thousand groups, maybe a MB or two). For each sort, about 50-60MB For each analytic, about 20MB If you add all those up and add another 25% you'll get a ballpark number for how much memory the query will require to execute. I'm working on reducing those numbers and making the system give a clearer yes/no answer on whether it can run the query before it starts executing.

Tim Armstrong · ‎05-25-2017

Hi Krishnat, It depends on the complexity of the query - that number is per-plan-node, not global. It may need more memory if there are a lot of operators in the plan. Hard to know without seeing the plan or execution summary (or both). Historically there were various bugs that might result in this happening in certain cases but I believe all the fixes landed in 5.10. I agree that the message and behaviour could be a lot more helpful/actionable - improving spill-to-disk is actually my primary focus right now - I'm very excited about the changes we have in the pipeline.

Tim Armstrong · ‎05-10-2017

I would suggest looking at the execution summary or profile to understand where the time is going. The progress only measures progress of the table scans, so this is consistent with the time is being spent in joins (or other operations) after the table scans. You probably just have a very large join. Could be that the join order in the query plan is not optimal or maybe you're just running the query on too much data for the cluster size. Usually the troubleshooting steps are something like: Look at the execution summary to get a high-level idea of where time is going and how many rows are flowing through the plan. It may be obvious - e.g. if a join with a lot of input rows is using a lot of time. If the problem is just the number of input rows, check if anything looks fishy with the plan (e.g., missing stats, joins where the right hand input is much larger than the left hand input, "exploding" joins that produce many more output rows than input rows). If you can't tell why things are slow from the execution summary, look at the query profile to drill down into where time went (this can be hard to interpret sometimes)

Tim Armstrong · ‎04-20-2017

Does the table have a lot of columns or anything unusual like that?

Tim Armstrong · ‎04-18-2017

On the Impala dev team we do plenty of testing on machines with 16GB-32GB RAM (e.g. my development machine has 32GB RAM). So Impala definitely works with that amount of memory. It's just that with that amount of memory it's not too hard to run into capacity problems if you have a reasonable number of concurrent queries with larger data sizes or more complex queries. It sounds like maybe the smaller memory instances work well for your workload.

Tim Armstrong · ‎04-12-2017

Oh and in case anyone else reads this and wants to know if they're hitting a similar issue, you can tell that the sort will be slow because it's using a lot of memory - 47.86GB. It takes a while to sort that much data.

Tim Armstrong · ‎04-12-2017

Another way to get results quicker is to add a limit to the query. "ORDER BY" without limit has this property where it doesn't return any rows until it has done most of the work of sorting the data. The reason it looked like the join was taking all the time is that the sort timer isn't updated until the first phases of the sort are complete. If the query had run to completion it would have had the correct time. I filed an issue to investigate making the sort timer update more frequently, so that it's easier to understand why a running query is slow: https://issues.apache.org/jira/browse/IMPALA-5200

Tim Armstrong · ‎04-03-2017

AlonEdi: the increment stats change should be in CDH5.10. Did you have trouble using it?

Tim Armstrong · ‎04-03-2017

CDH5.10 has essentially all of the Impala 2.8 improvements in it, as mentioned earlier in the thread. Lars can confirm, but I don't believe that the "SORT BY" fix made it into either Impala 2.8 or CDH5.10, I think it got pushed out to the next release. I think the docs are incorrect: https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html#new_features_280

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Imapla query is slow, where is the bottleneck?

Re: impala memory limit exceed

Re: impala memory limit exceed

Re: Impala query does not respond,seems stuck or g...

Re: CodeGen way to slow

Re: Low memory Impala nodes (e.g. 15GB RAM)

Re: Impala join query running slow

Re: Impala join query running slow

Re: Need help with Impala 2.8 on CDH 5.10 upgrade

Re: Need help with Impala 2.8 on CDH 5.10 upgrade