Created 08-22-2023 02:39 AM
Hi,
I face the following problem.
I have a service user say "service_dwh" used for our datawarehouse that queries heavily our data reservoir using hive queries.
I have had some cases where due to the query and/or to missing statistics a single hive query could take 100% of the resources available for the "service_dwh" user.
I couldn't find a way using capacity scheduler, queues and user limit factor to prevent a single application from taking all the resources for a very long time.
Traditional DBMS have mechanism that throttle job resources based of the job duration.
That way a long (big) job can't monopolyze ressources for new and potentially shorter jobs for too long.
Created 09-05-2023 03:46 AM
Please refer to the below articles and see if this is what you are looking for:
[1] https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/yarn-allocate-resources/topics/yarn-configure... [2] https://blog.cloudera.com/yarn-capacity-scheduler/#:~:text=User%20Limit%20Factor%20is%20a,minimum%20....
Let me know if this helps.
Cheers!
Created 09-07-2023 12:32 AM
Hi tj,
I knew already f this options but sadly as mentionned in my post all the queries are ran by the same service user.
Hence I have no way to use the user limit factor.
I would like to have a query limit factor in hive or some way to prevent one query to use too much capacity even if available.
Gael
Created 10-19-2023 04:39 PM
Why not using the resource pool and sub pool, if it specific query then pass the resource pool for this query and create resource pool or subpool for this query