Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala join query running slow

avatar
Explorer

Hi All, 

 

I am running on a POC environment where there are only one name node and one data node running. Impala daemon is running on data node. Both of the nodes have 128GB memory each. I had set the mem_limit to 60GB.

 

I had two big tables in Impala. First table (sales) has around 635 million records while second table (product) is around 250000 records. I inner join this 2 tables using a common parameter. The example SQL statement is as the following:

 

select a.t_date, b.p_date from sales a inner join product b on a.product_id=b.product_id order by a.t_date desc

 

The above query ran for a long time and after few hours, it still didn't return any result. When i use EXPLAIN, it showed Estimated Per-Host Requirements: Memory=992.03MB VCores=2. I am wondering why it took so long. Is this related to mem_limit settings? How can i tune such query?

11 REPLIES 11

avatar

Oh and in case anyone else reads this and wants to know if they're hitting a similar issue, you can tell that the sort will be slow because it's using a lot of memory - 47.86GB. It takes a while to sort that much data.

avatar
Explorer

Column a.t_date is a string field, not a timestamp field. The two tables are Parquet file format.

 

By adding more nodes into the cluster, more Impala Daemon are running, can we aspect the performance for such query will be improve?