Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

When to use webhdfs and hdfs

When to use webhdfs and hdfs

Hi All, 

 

I am getting confused in using these protocols, I am trying to migrate data between cluster using distcp command and got stuck up here. 

 

1. Which one will be faster? 

2. Can we use one protocol at source and other at destination (I mean combination of both)

3. When can we webhdfs in particular

4. Will there be any speed difference in transfer between in using these protocols.

5. What will be the port numbers needed in using these (somewhere I saw commands with 50070 and 80020, when to use what) 

 

If there is any document or URL on this topic, please share 

 

Thanks

Kishore

6 REPLIES 6
Highlighted

Re: When to use webhdfs and hdfs

Master Guru

> 1. Which one will be faster?

The native protocol of HDFS is hdfs:// and this is the fastest type (purely TCP, with efficient data packet transfers). Other protocols such as webhdfs:// or the deprecated hftp:// add overheads due to their HTTP usage that make them slower overall.

> 2. Can we use one protocol at source and other at destination (I mean combination of both)
> 3. When can we webhdfs in particular

Yes to (2).
See http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_admin_distcp_da... for (3).

Rule of thumb is:
- Use webhdfs:// for source when its a different major version (such as a CDH4 source to CDH5 target).
- Use hdfs:// otherwise, when the major version is the same (such as between any CDH 5.x).
- Prefer webhdfs:// over hftp://, unless its a very old version (pre CDH3u5) that has no WebHDFS support.

> 4. Will there be any speed difference in transfer between in using these protocols.

Yes. This is also a repeat of (1), which I've answered above.

> 5. What will be the port numbers needed in using these (somewhere I saw commands with 50070 and 80020, when to use what)

Follow the CDH5 ports guide at http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_ports_cdh5.h... to find the right ports for your environment. Defaults are used in the below statement.


HDFS native protocol transfers require every host on the DistCp job cluster (usually target), to be able to talk to the source's 8020 (for NameNode(s)) and 50010/1004, 50020 (across all DataNodes) ports.
WebHDFS or HFTP, HTTP based protocol transfers require every host on the DistCp job cluster (usually target), to be able to talk to the source's 50070 (for NameNode(s)) and 50075/1006 (across all DataNodes) ports.

Re: When to use webhdfs and hdfs

Hi Harsha,

 

Thanks for the quick reply and they are pretty clear to understand. 

 

I am transferring data from insecure cluster to secure cluster,from the link you provided I can see that we need to use either hdfs or webhdfs. Insecure cluster is of 5.3.x and secure cluster is of 5.4.x, so I am using webhdfs at source and webhdfs at destination. Is this the best way to do ? 

 

You are suggesting to use webhdfs to hdfs, but in first point you said hdfs uses tcp and it will be faster than https .. Right ? 

 

Please correct me if going wrong anywhere.. 

 

Thanks in advance.

 

Thanks

Kishore 

Re: When to use webhdfs and hdfs

Hi Harsh, 

 

I forgot to specify my problem in previous post - 

 

I am running distcp command with following command 

 

time hadoop distcp -p -strategy dynamic -m 40 webhdfs://<Source Name node IP>:50070/<path to file>   hdfs://<Name service of destination cluster>/<path to file>

 

But getting error -->  ERROR [main] org.apache.hadoop.tools.util.RetriableCommand: Failure in Retriable command: Copying  webhdfs://<Source Name node IP>:50070/<path to file>   hdfs://<Name service of destination cluster>/<path to file>

 

I am executing this command on destination cluster

 

I have following doubts to get cleared before reexecuting  the command  - 

 

1. I am not able to use name service at source location instead of <Source Name node IP>:50070 and execute the command 

2. Do we need to mention port 8022 when using hdfs at destination when we use name service ?

 

Thanks

Kishore

Re: When to use webhdfs and hdfs

Master Guru
Normally you can always use hdfs:// at the destination (as long as the job is run on the destination).

If your source is insecure, you will need to pass the mentioned property in the documentation linked earlier:

"""
To enable the fallback configuration, for copying between a secure cluster and an insecure one, add the following to the HDFS core-default.xml, by using an advanced configuration snippet if you use Cloudera Manager, or editing the file directly otherwise.

<property>
<name>ipc.client.fallback-to-simple-auth-allowed</name>
<value>true</value>
</property>
"""

Using name service ID for sources would require the name service config defined in the client configs of the destination cluster. This isn't done via CM automatically today, but is not super difficult to perform. For most ad-hoc run purposes, passing the active NN directly works good enough.

The port 8022 is not for client/end-user/job usage, and need not be specified.

Re: When to use webhdfs and hdfs

Thank you for the detailed explanination. 

Re: When to use webhdfs and hdfs

Hi Harsha, 

 

Can you please look into this aswell .

 

https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/Unauthorized-user-is-able-to-create-in-hive-...

 

Thanks

Kishore

Don't have an account?
Coming from Hortonworks? Activate your account here