Reply
New Contributor
Posts: 2
Registered: ‎03-07-2019

How to be sure that a query job is executed in parallel in the available data nodes?

[ Edited ]

Hi.

When running a heavy query (over a table view which is joining 5 different tables) in Hue, I checked impalad web interface (Query  Details/ Backends) to verify if we are using all the available data nodes (we have 3 data nodes available).

I got the following information (only one row):

Host                               Num. Instances         Num. remaining instances     Done            Peak Memory Consumption     datanode1.com:22000   14                                2                                           false             1932005071                              

 

Does this mean that the query was using only one of the data nodes?

Thanks

Nuno

Cloudera Employee
Posts: 431
Registered: ‎07-29-2015

Re: How to be sure that a query job is executed in parallel in the available data nodes?

The query profile and/or execution summary is the best reference for this. Parallelism for Parquet files depends on the number of HDFS blocks (which is usually the same as the number of Parquet files), so if your tables only have one HDFS block each you may not get parallelism.

New Contributor
Posts: 2
Registered: ‎03-07-2019

Re: How to be sure that a query job is executed in parallel in the available data nodes?

Hi.

In query profile and/or execution summary I can see the several steps to which my query is divided, but I don´t find any reference to which impala daemons / data nodes will be used to execute them.

While the query is being executed it shows that a total of 636 blocks are being scanned.

In query details - backends and fragment instances, I can see that there are several instances running for that query, but all in the same impala daemon / data node. I was expecting to see them being assigned to the different impalad/data nodes, so I suspect that not all the resources are being used.

Thanks