Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to be sure that a query job is executed in parallel in the available data nodes?

How to be sure that a query job is executed in parallel in the available data nodes?

New Contributor

Hi.

When running a heavy query (over a table view which is joining 5 different tables) in Hue, I checked impalad web interface (Query  Details/ Backends) to verify if we are using all the available data nodes (we have 3 data nodes available).

I got the following information (only one row):

Host                               Num. Instances         Num. remaining instances     Done            Peak Memory Consumption     datanode1.com:22000   14                                2                                           false             1932005071                              

 

Does this mean that the query was using only one of the data nodes?

Thanks

Nuno

2 REPLIES 2

Re: How to be sure that a query job is executed in parallel in the available data nodes?

Master Collaborator

The query profile and/or execution summary is the best reference for this. Parallelism for Parquet files depends on the number of HDFS blocks (which is usually the same as the number of Parquet files), so if your tables only have one HDFS block each you may not get parallelism.

Re: How to be sure that a query job is executed in parallel in the available data nodes?

New Contributor

Hi.

In query profile and/or execution summary I can see the several steps to which my query is divided, but I don´t find any reference to which impala daemons / data nodes will be used to execute them.

While the query is being executed it shows that a total of 636 blocks are being scanned.

In query details - backends and fragment instances, I can see that there are several instances running for that query, but all in the same impala daemon / data node. I was expecting to see them being assigned to the different impalad/data nodes, so I suspect that not all the resources are being used.

Thanks