Basically, I'm attempting to create a standalone Yarn application that runs code on hdfs files on a block level for validation purposes. First I will need to get where the different blocks are distributed (hdfs file metadata), and will run validation code on every block and will combine the results to determine if the file is valid or not. I want to minimize/eliminate network overhead by making the application validation code run on the same node on which the block resides (data locality).
So, for instance, a 1 GB file could be divided into 10 blocks. In order to run the same validation code on all blocks in parallel, I would want to run 10 different instances.
My thinking is, I would launch a single ApplicationMaster with 10 containers.
Does my approach make sense, and if not, what do you suggest I change in my approach? And if it does make sense, how do I launch 10 different instances while (possibly) determining 10 different hosts for every container.