Created 07-03-2025 12:58 AM
Hello All.
What problems can there be when copying data between two clusters with different major versions if you use hdfs://... instead of webhdfs://...
hadoop distcp hdfs://<namenode>:<port> hdfs://<namenode>
Examle from documetntation -
Run the distcp command on the cluster that runs the higher version of Cloudera, which should be the destination cluster. Use the following syntax:
hadoop distcp webhdfs://<namenode>:<port> hdfs://<namenode>
Note the webhdfs prefix for the remote cluster, which should be your source cluster. You must use webhdfs when the clusters run different major versions. When clusters run the same version, you can use the hdfs protocol for better performance.
For example, the following command copies data from a Cloudera source cluster named example-source to another Cloudera version destination cluster named example-dest:
hadoop distcp webhdfs://example-source.cloudera.com:8020 hdfs://example-dest.cloudera.com
Created 08-13-2025 07:24 AM
Hello @vit
Thank you for reaching out Cloudera Community.
The hdfs:// protocol allows DataNodes and NameNodes to communicate directly using Hadoop's internal Remote Procedure Call (RPC) mechanism. This protocol is highly optimized for performance within a single cluster version. However, this internal RPC protocol is not guaranteed to be compatible between major versions.
However webhdfs:// protocol avoids these problems because it is not based on the internal, version-specific RPC system. Instead, it uses a standardized REST API that communicates over HTTP/S
This is why Cloudera's documentation (and general Hadoop best practice) insists on using webhdfs:// when running distcp between clusters of different major versions.
Hope this helps. While performing distcp, if you face any issues/challenges please don't hesitate to reach out Cloudera Support by raising ticket through MyCloudera portal.
Created 08-13-2025 07:24 AM
Hello @vit
Thank you for reaching out Cloudera Community.
The hdfs:// protocol allows DataNodes and NameNodes to communicate directly using Hadoop's internal Remote Procedure Call (RPC) mechanism. This protocol is highly optimized for performance within a single cluster version. However, this internal RPC protocol is not guaranteed to be compatible between major versions.
However webhdfs:// protocol avoids these problems because it is not based on the internal, version-specific RPC system. Instead, it uses a standardized REST API that communicates over HTTP/S
This is why Cloudera's documentation (and general Hadoop best practice) insists on using webhdfs:// when running distcp between clusters of different major versions.
Hope this helps. While performing distcp, if you face any issues/challenges please don't hesitate to reach out Cloudera Support by raising ticket through MyCloudera portal.
Created 09-04-2025 04:38 AM
Thank you very much for the detailed answer!
Created 09-04-2025 06:32 AM
Hello @vit
I'm glad that you got the answer which you are looking for. Could you please "Accept as Solution" as well ?