Support Questions

Find answers, ask questions, and share your expertise

When to use distcp -strategy dynamic and why isn't it the Hadoop default?

avatar
Contributor

We did some preliminary tests and it seems distcp with -strategy dynamic improves performance by a substation amount on our workload. Digging through what documentation I can find, it does say that it improves performance with MOST workloads, but I can't find any clear guidance on what workloads it would perform poorly with.

1. If it is so much better in most situations, why isn't -strategy dynamic the Hadoop default?

2. What are the potential downsides to using it by default? Is there any use case where -strategy uniform would perform better?

1 ACCEPTED SOLUTION

avatar
Guru

For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.

However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.

View solution in original post

2 REPLIES 2

avatar
Guru

For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.

However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.

avatar
Contributor

Interesting, I was seeing significant speedups even with maps running symmetrical data nodes. So the only downside is the initial setup time is greater with dynamic strategy.