Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

When to use distcp -strategy dynamic and why isn't it the Hadoop default?

Solved Go to solution

When to use distcp -strategy dynamic and why isn't it the Hadoop default?

We did some preliminary tests and it seems distcp with -strategy dynamic improves performance by a substation amount on our workload. Digging through what documentation I can find, it does say that it improves performance with MOST workloads, but I can't find any clear guidance on what workloads it would perform poorly with.

1. If it is so much better in most situations, why isn't -strategy dynamic the Hadoop default?

2. What are the potential downsides to using it by default? Is there any use case where -strategy uniform would perform better?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: When to use distcp -strategy dynamic and why isn't it the Hadoop default?

Guru

For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.

However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.

View solution in original post

2 REPLIES 2
Highlighted

Re: When to use distcp -strategy dynamic and why isn't it the Hadoop default?

Guru

For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.

However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.

View solution in original post

Highlighted

Re: When to use distcp -strategy dynamic and why isn't it the Hadoop default?

Interesting, I was seeing significant speedups even with maps running symmetrical data nodes. So the only downside is the initial setup time is greater with dynamic strategy.

Don't have an account?
Coming from Hortonworks? Activate your account here