Created 08-05-2022 03:14 PM
Basically, I'm attempting to create a standalone Yarn application that runs code on hdfs files on a block level for validation purposes. First I will need to get where the different blocks are distributed (hdfs file metadata), and will run validation code on every block and will combine the results to determine if the file is valid or not. I want to minimize/eliminate network overhead by making the application validation code run on the same node on which the block resides (data locality).
So, for instance, a 1 GB file could be divided into 10 blocks. In order to run the same validation code on all blocks in parallel, I would want to run 10 different instances.
My thinking is, I would launch a single ApplicationMaster with 10 containers.
Does my approach make sense, and if not, what do you suggest I change in my approach? And if it does make sense, how do I launch 10 different instances while (possibly) determining 10 different hosts for every container.
Created 08-07-2024 03:58 AM
Hi @husseljo ,
Application master will be a single container that will run as AM.
Capacity Scheduler leverages Delay Scheduling to honor task locality constraints. There are three levels of locality constraint: node-local, rack-local, and off-switch. The scheduler counts the number of missed opportunities when the locality cannot be satisfied and waits for this count to reach a threshold before relaxing the locality constraint to the next level. You can configure this threshold using the Node Locality Delay (yarn.scheduler.capacity.node-locality-delay) and Rack Locality Additional Delay (yarn.scheduler.capacity.rack-locality-additional-delay) fields
Created 08-16-2024 09:06 AM
Hi @husseljo ,
Please mark this "Accept as solution" if you find my answer helped you.