We run Mapreduce jobs that have >100K mappers. Each mapper takes less than 10 sec to run. We could take advantage of JVM reuse in MR1 which hadoop can reuse the JVM for new mappers.
As everybodys knows that JVM reuse is disabled in YARN/MR2. So for each mapper, a new JVM/container will be launched, it will take extra few secondes to luanch a new container. You can imaging the performance for jobs that have more than 100K mappers with this overhead can be impacted badly.
We cannot use uber tasks since our mapper number is huge.
Does Cloudera has a solution for this? I think it's really a bad idea to retire jvm reuse in YARN. At least people can make it avialble and set the default to disable.
Please do let me know if you find any work-around or alternate solution to solve this problem. I am also looking for solution on this topic.