Created on 04-02-2017 09:54 PM - edited 09-16-2022 04:23 AM
Hi All,
I am running on a POC environment where there are only one name node and one data node running. Impala daemon is running on data node. Both of the nodes have 128GB memory each. I had set the mem_limit to 60GB.
I had two big tables in Impala. First table (sales) has around 635 million records while second table (product) is around 250000 records. I inner join this 2 tables using a common parameter. The example SQL statement is as the following:
select a.t_date, b.p_date from sales a inner join product b on a.product_id=b.product_id order by a.t_date desc
The above query ran for a long time and after few hours, it still didn't return any result. When i use EXPLAIN, it showed Estimated Per-Host Requirements: Memory=992.03MB VCores=2. I am wondering why it took so long. Is this related to mem_limit settings? How can i tune such query?
Created 04-12-2017 08:10 AM
Oh and in case anyone else reads this and wants to know if they're hitting a similar issue, you can tell that the sort will be slow because it's using a lot of memory - 47.86GB. It takes a while to sort that much data.
Created on 04-12-2017 09:00 PM - edited 04-12-2017 09:10 PM
Column a.t_date is a string field, not a timestamp field. The two tables are Parquet file format.
By adding more nodes into the cluster, more Impala Daemon are running, can we aspect the performance for such query will be improve?