Yarn allocates same cpu and memory resources for all tasks of a job.
We have a use case where small task will take 500MB memory, bigger task will take 12GB, is there any way to achive different resources for different tasks of same job?
Thanks @Harsh J
Yes, its not possible to parallelize larger task, a business need.
right now the tasks are taking 500MB to 12GB for a job, so we are allocating allocating 12GB per each task, so we are not able to utilize the cluster effectively, we are using DRF, memory is dominant resource. All three fair scheduler policies are memory based, is there any custom CPU only based policy?
Is it possible to hack the Yarn code/rewrite the scheduler etc.., is this a possibility? we have knowledge (can be databse table) which task will take how much memory.
>>> Are you seeing a measurable impact of the higher memory requests in terms of concurrency? Since the smaller data sizes (500 MiB in your example) require lower memory, they should be completing quicker too - perhaps that helps compensate the higher requests?
Yes, thought about this, this is what happenning on the cluster, right now we are over committing the memory so that all the vCores are used.
I will take a look into mapreduce client side code you pointed.
Thanks for the help.