Support Questions

gary_cameron · ‎07-27-2016

We did some preliminary tests and it seems distcp with -strategy dynamic improves performance by a substation amount on our workload. Digging through what documentation I can find, it does say that it improves performance with MOST workloads, but I can't find any clear guidance on what workloads it would perform poorly with.

1. If it is so much better in most situations, why isn't -strategy dynamic the Hadoop default?

2. What are the potential downsides to using it by default? Is there any use case where -strategy uniform would perform better?

ravi1 · ‎07-27-2016

For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.

However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.

View solution in original post

ravi1 · ‎07-27-2016

For smaller distcp jobs, I think setup time on dynamic strategy will be longer than for the uniform size strategy. And if all maps are running at similar speeds, then you won't gain much using dynamic strategy and lose the setup time.

However, not all maps run at similar speeds. With dynamic strategy, slower running maps will get to process less data and faster running maps process more data. I haven't got the exact amount of data where one works better than other, but in general on larger datasets and on heterogenous cluster (not all workers are same hardware), dynamic strategy has advantage.

gary_cameron · ‎07-28-2016

Interesting, I was seeing significant speedups even with maps running symmetrical data nodes. So the only downside is the initial setup time is greater with dynamic strategy.

Cloudera Community

Support Questions

When to use distcp -strategy dynamic and why isn't it the Hadoop default?

FOUR STEP STRATEGY FOR INCREMENTAL UPDATES IN APAC...

Managing Hadoop DR with 'distcp' and 'snapshots'

Hadoop Distcp -update skips file

Hive Query Recovery Tactics: Handling Failures thr...

Kerberos cross realm trust for distcp

Apache Storm Resource Contention Resolution Strat...

Services Restart Strategy for Cloudera Hadoop Clus...

'Move Conflict Strategy' usage in FetchFile

UMask vs HDFS default ACLs

Schedule Invoking HTTP dynamically - Nifi