I have a hive insert statement which by default will use all available resources in YARN as it is reading a large volume of data.
I am happy for the query to take longer and use less resources so that other users can also have access to compute resources.
I don't want to set up YARN queues as this is an unusual query and so don't want to permanently restrict the cluster.
If I was using Spark can do this quite easily with setting a number of executors. Is there a hive config that allows me to do this at a query level.
I have looked at various other posts such as those below, but nothing seems to allow this.
Also seen this: https://community.cloudera.com/t5/Support-Questions/How-are-number-of-mappers-determined-for-a-query... - but not sure if changing split sizes is a good idea. Would this then impact the structure of data stored by my data.
Grateful for any suggestions.
Created 11-23-2021 12:53 PM
Hi @Andyjmoss
As you already pointed https://community.cloudera.com/t5/Support-Questions/How-are-number-of-mappers-determined-for-a-query...
There is no limit per query, you can only adjust max and min grouping size to play around on mapper tasks.
Would this then impact the structure of data stored by my data?
No this only affects how much data each map task will get.
Created 11-23-2021 12:53 PM
Hi @Andyjmoss
As you already pointed https://community.cloudera.com/t5/Support-Questions/How-are-number-of-mappers-determined-for-a-query...
There is no limit per query, you can only adjust max and min grouping size to play around on mapper tasks.
Would this then impact the structure of data stored by my data?
No this only affects how much data each map task will get.
Created 11-26-2021 01:25 AM
Thanks @rpathak - having discussed this further amongst our team we think we are going to try setting up elastic YARN queues to help this situation.