Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Co-located client

avatar
Rising Star

In production setups are files loaded into HDFS from a particular machine?

If so, if that machine were also a data node then would not that machine be identified as a co-located client - thus prevent data distribution across the cluster?

Or is the standard practice to load the files from the name node host?

Or what other practice is commonly used for loading files into HDFS?

Appreciate the insights.

1 ACCEPTED SOLUTION

avatar
Super Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
14 REPLIES 14

avatar
Super Guru

In production, you would have "edge nodes" where you have client programs install and they are talking to the cluster. But even if you put data in local file system on data node and then copy into HDFS, it will not prevent data distribution. The client file is in local file system (XFS, ext4) which is unrelated to HDFS (well not exactly, but as far as your question is concerned).

Standard practice is to use Edge node and not name node.

avatar
Rising Star

If moving files into hdfs from datanode will not prevent distribution then when does co-located client dynamic work?

Also is the edge node that you mention a datanode? If not is it simply a machine with hadoop software to facilitate interaction with hdfs?

Appreciate the feedback.

avatar
Super Guru

is it simply a machine with hadoop software to facilitate interaction with hdfs?

yes.

avatar
Rising Star

Hello, any response on whether the 'edge node' is a datanode?

Appreciate the feedback.

avatar
Super Guru

whether the 'edge node' is a datanode?

No. You can if you want, put edge processes like client configs to run client programs on the same node as data node but that doesn't make data node an edge node. Ideally this is not recommended but if you have very small cluster, then sure, no problem with that.

avatar
Rising Star

So what is required for the edge node to connect to the cluster : hadoop software, core-site.xml, hdfs-site.xml, ... and what else ?

Appreciate the clarification.

avatar
Rising Star

Can I have a response on what is required for the edge node to connect to the cluster pls?

Appreciate the feedback.

avatar

You can also securely do this via a REST API over HTTP from any node:

1. WebHDFS: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_hdfs_admin_tools/content/ch12.html

2. HttpFS - If you plan on using WebHDFS in a High Availability cluster (Active and Passive NameNodes)

You can also implement Knox for a single and secure Rest access point (with different port numbers) for: - Ambari - WebHDFS - HCatalog - HBase - Oozie -Hive -Yarn -Resource Manager -Storm http://hortonworks.com/apache/knox-gateway/

http://hortonworks.com/hadoop-tutorial/securing-hadoop-infrastructure-apache-knox/

avatar
Rising Star

Thanks Binu.