Created 07-27-2016 08:09 PM
We did some preliminary tests and it seems distcp with -strategy dynamic improves performance by a substation amount on our workload. Digging through what documentation I can find, it does say that it improves performance with MOST workloads, but I can't find any clear guidance on what workloads it would perform poorly with.
1. If it is so much better in most situations, why isn't -strategy dynamic the Hadoop default?
2. What are the potential downsides to using it by default? Is there any use case where -strategy uniform would perform better?
Created 07-27-2016 08:49 PM
For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.
However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.
Created 07-27-2016 08:49 PM
For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.
However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.
Created 07-28-2016 12:35 PM
Interesting, I was seeing significant speedups even with maps running symmetrical data nodes. So the only downside is the initial setup time is greater with dynamic strategy.