03-07-2019 04:33 AM - edited 03-07-2019 04:38 AM
When running a heavy query (over a table view which is joining 5 different tables) in Hue, I checked impalad web interface (Query Details/ Backends) to verify if we are using all the available data nodes (we have 3 data nodes available).
I got the following information (only one row):
Host Num. Instances Num. remaining instances Done Peak Memory Consumption datanode1.com:22000 14 2 false 1932005071
Does this mean that the query was using only one of the data nodes?
03-07-2019 09:12 AM
The query profile and/or execution summary is the best reference for this. Parallelism for Parquet files depends on the number of HDFS blocks (which is usually the same as the number of Parquet files), so if your tables only have one HDFS block each you may not get parallelism.
03-08-2019 07:18 AM
In query profile and/or execution summary I can see the several steps to which my query is divided, but I don´t find any reference to which impala daemons / data nodes will be used to execute them.
While the query is being executed it shows that a total of 636 blocks are being scanned.
In query details - backends and fragment instances, I can see that there are several instances running for that query, but all in the same impala daemon / data node. I was expecting to see them being assigned to the different impalad/data nodes, so I suspect that not all the resources are being used.