Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Co-located client

Explorer

In production setups are files loaded into HDFS from a particular machine?

If so, if that machine were also a data node then would not that machine be identified as a co-located client - thus prevent data distribution across the cluster?

Or is the standard practice to load the files from the name node host?

Or what other practice is commonly used for loading files into HDFS?

Appreciate the insights.

1 ACCEPTED SOLUTION

Super Guru

I think there is confusion on how we are defining colocated clients. This is because of the way I first understood and answered your question. Your edge nodes are different from what Hadoop calls colocated client which is probably what you have read somewhere.

When a Map Reduce job runs, it spawns number of mappers. each mapper is usually reading data on the node it is running on (the local data). Each of these mappers is a client or colocated client assuming its reading local data. However, these mappers were not taking advantage of the fact that data they are reading is local to where these mappers are running (A mapper in some cases might read data from a remote machine which is going to be inefficient if it happens). The mappers were using the same TCP protocol in reading local data as they would use to read remote data. It was determined that performance can be improved about 20-30% only by making these mappers read data blocks directly off of disk if they can be made aware of the local data. Hence this change was made to improve that performance. If you would like more details, please see the following JRA.

https://issues.apache.org/jira/browse/HDFS-347

If you scroll down in this jira then you will see a design document. That design document will clear any confusion you may have.

View solution in original post

14 REPLIES 14

Super Guru

In production, you would have "edge nodes" where you have client programs install and they are talking to the cluster. But even if you put data in local file system on data node and then copy into HDFS, it will not prevent data distribution. The client file is in local file system (XFS, ext4) which is unrelated to HDFS (well not exactly, but as far as your question is concerned).

Standard practice is to use Edge node and not name node.

Explorer

If moving files into hdfs from datanode will not prevent distribution then when does co-located client dynamic work?

Also is the edge node that you mention a datanode? If not is it simply a machine with hadoop software to facilitate interaction with hdfs?

Appreciate the feedback.

Super Guru

is it simply a machine with hadoop software to facilitate interaction with hdfs?

yes.

Explorer

Hello, any response on whether the 'edge node' is a datanode?

Appreciate the feedback.

Super Guru

whether the 'edge node' is a datanode?

No. You can if you want, put edge processes like client configs to run client programs on the same node as data node but that doesn't make data node an edge node. Ideally this is not recommended but if you have very small cluster, then sure, no problem with that.

Explorer

So what is required for the edge node to connect to the cluster : hadoop software, core-site.xml, hdfs-site.xml, ... and what else ?

Appreciate the clarification.

Explorer

Can I have a response on what is required for the edge node to connect to the cluster pls?

Appreciate the feedback.

You can also securely do this via a REST API over HTTP from any node:

1. WebHDFS: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_hdfs_admin_tools/content/ch12.html

2. HttpFS - If you plan on using WebHDFS in a High Availability cluster (Active and Passive NameNodes)

You can also implement Knox for a single and secure Rest access point (with different port numbers) for: - Ambari - WebHDFS - HCatalog - HBase - Oozie -Hive -Yarn -Resource Manager -Storm http://hortonworks.com/apache/knox-gateway/

http://hortonworks.com/hadoop-tutorial/securing-hadoop-infrastructure-apache-knox/

Explorer

Thanks Binu.

Explorer

BTW what is the actual advantage of a co-located client? It only stores the first block of the file at the client/datanode right? The rest of the blocks are distributed across the HDFS. So what is the big advantage of storing the first block? Does that really help performance?

Appreciate the insights.

Super Guru

Starting from version 2.1, you should see better read performance for colocated clients. For write, not so much. It's the read that will be faster because the client is on the same machine as the data block.

If my answer helped, please accept.

Explorer

But are people going to access only one block? Big Data itself implies processing of thousands of blocks.

So why would the faster access of one single block matter?

Super Guru

I think I understand your point. See my new answer.

Super Guru

I think there is confusion on how we are defining colocated clients. This is because of the way I first understood and answered your question. Your edge nodes are different from what Hadoop calls colocated client which is probably what you have read somewhere.

When a Map Reduce job runs, it spawns number of mappers. each mapper is usually reading data on the node it is running on (the local data). Each of these mappers is a client or colocated client assuming its reading local data. However, these mappers were not taking advantage of the fact that data they are reading is local to where these mappers are running (A mapper in some cases might read data from a remote machine which is going to be inefficient if it happens). The mappers were using the same TCP protocol in reading local data as they would use to read remote data. It was determined that performance can be improved about 20-30% only by making these mappers read data blocks directly off of disk if they can be made aware of the local data. Hence this change was made to improve that performance. If you would like more details, please see the following JRA.

https://issues.apache.org/jira/browse/HDFS-347

If you scroll down in this jira then you will see a design document. That design document will clear any confusion you may have.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.