Support Questions

Find answers, ask questions, and share your expertise

Distcp copying between major versions

avatar
New Contributor

Hello All.

What problems can there be when copying data between two clusters with different major versions if you use hdfs://... instead of webhdfs://...

hadoop distcp hdfs://<namenode>:<port> hdfs://<namenode>

Examle from documetntation -

Copying between major versions

Run the distcp command on the cluster that runs the higher version of Cloudera, which should be the destination cluster. Use the following syntax:

hadoop distcp webhdfs://<namenode>:<port> hdfs://<namenode>

Note the webhdfs prefix for the remote cluster, which should be your source cluster. You must use webhdfs when the clusters run different major versions. When clusters run the same version, you can use the hdfs protocol for better performance.

For example, the following command copies data from a Cloudera source cluster named example-source to another Cloudera version destination cluster named example-dest:

hadoop distcp webhdfs://example-source.cloudera.com:8020 hdfs://example-dest.cloudera.com

1 REPLY 1

avatar
Contributor

Hello @vit 

Thank you for reaching out Cloudera Community.

The hdfs:// protocol allows DataNodes and NameNodes to communicate directly using Hadoop's internal Remote Procedure Call (RPC) mechanism. This protocol is highly optimized for performance within a single cluster version.  However, this internal RPC protocol is not guaranteed to be compatible between major versions.

However webhdfs:// protocol avoids these problems because it is not based on the internal, version-specific RPC system. Instead, it uses a standardized REST API that communicates over HTTP/S

This is why Cloudera's documentation (and general Hadoop best practice) insists on using webhdfs:// when running distcp between clusters of different major versions.

Hope this helps.  While performing distcp, if you face any issues/challenges please don't hesitate to reach out Cloudera Support by raising ticket through MyCloudera portal.