Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar

Introduction

Motivation

In many production clusters, the servers (or VMs) have multiple physical Network Interface Cards (NICs) and/or multiple Virtual Network Interfaces (VIFs) and are configured to participate in multiple IP address spaces (“networks”). This may be done for a variety of reasons, including but not limited to the following:

  • to improve availability or bandwidth through redundancy
  • to partition traffic for security, bandwidth, or other management reasons
  • etc.

In many of these “multi-homed” clusters it is desirable to configure the various Hadoop-related services to “listen” on all the network interfaces rather than on just one. Clients, on the other hand, must be given a specific target to which they should initiate connections.

Note that NIC Bonding (also known as NIC Teaming or Link Aggregation), where two or more physical NICs are made to look like a single network interface, is not included in this topic. The settings discussed below are usually not applicable to a NIC bonding configuration which handles multiplexing and failover transparently while presenting only one “logical network” to applications. However, there is a hybrid case where multiple physical NICs are bonded for redundancy and bandwidth, then multiple VIFs are layered on top of them for partitioning of data flows. The multi-homing parameters then relate to the VIFs.

This Document

The purpose of this document is to gather in one place all the many parameters for all the different Hadoop-related components, related to configuring multi-homing support for services and clients. In general, to support multi-homing on the services you care about, you must set these parameters in those components and any components upon which they depend. Almost all components depend on Hadoop Core, HDFS and Yarn, so these are given first, along with Security related parameters.

This document assumes you have HDP version 2.3 or later. Most but not all of the features are available in 2.1 and 2.2 also.

Vocabulary

  • Server and service are carefully distinguished in this document. Servers are host machines or VMs. Services are Hadoop component processes running on servers.
  • Client: a process that initiates communication with a service. Client processes may be on a host outside of the Hadoop cluster, or on a server that is part of the Hadoop cluster. Sometimes Hadoop services are clients of other Hadoop services.
  • DNS and rDNS lookup: “Distributed Name Service” (alt. “Domain Name Service”) - The service that converts hostnames to IP addresses for clients wishing to communicate with said hostname; this is often referred to as “DNS lookup”. If properly configured, the service can also provide “reverse-DNS” (rDNS) lookup, which converts IP addresses back into hostnames. It is common for different networks to have different DNS Servers, and the same DNS or rDNS query to different name servers may get different responses. This is used for Hadoop Multi-Homing to provide a level of indirection associating different IP addresses assigned to a multi-homed server for clients on different networks.
  • Round-trip-able DNS: Within a single network’s DNS service, if one starts with a valid IP address and performs an rDNS lookup to get a hostname, then DNS lookup on that hostname should return the original IP address; or alternatively, if one starts with a valid hostname and performs a DNS lookup on it to get an IP address, then rDNS lookup on that IP address should return the original hostname.
  • FQDN: “Fully Qualified Domain Name” - the long form of hostnames, of the form “shortname.domainname”, eg, “host234.lab1.hortonworks.com”. This contrasts with the “short form” hostname, which in this example would be just “host234”.

Use Cases

To understand the semantics of configuration parameters for multi-homing, it is useful to consider a few practical use cases for why it is desirable to have multi-homed servers and how Hadoop can utilize them. Here are four use cases that cover most of the concepts, although this is not an exhaustive list:

  1. intra-cluster network separate from external client-server network
  2. dedicated high-bandwidth network separate from general-use network
  3. dedicated high-security or Hadoop administrator network separate from medium-security general-use network
  4. sysop/non-Hadoop network separate from Hadoop network(s)

The independent variables in each use case are:

  1. which network interfaces are available on servers and client hosts, within the cluster and external to the cluster; and
  2. which interfaces are preferred for service/service and client/service communications for each service.

<Use cases to be expanded in a later version of this document.>

Theory of Operations

In each use case, the site architects will determine which network interfaces are available on servers and client hosts, and which interfaces are preferred for service/service and client/service communications of various types. We must then provide configuration information that specifies for the cluster:

  1. How clients will find the server hostname of each service
  2. Whether each service will bind to one or all network interfaces on its server
  3. How services will self-identify their server to Kerberos

Over time, the Hadoop community evolved an understanding of the configuration practices that make these multi-homing use cases manageable:

  1. Configuration files should be shared among clients and services, as much as possible. This is highly desirable to minimize management overhead, and the desire to share common config files has guided a lot of the design choices for multi-homing support in Hadoop.
  2. Names are important. In multi-homed cluster configuration parameters, servers shouldalways be identified or “addressed” by hostname, not IP address.[1] This enables the next several bullet points.
  3. FQDNs are strongly preferred. It is recommended that all hostname strings be represented as Fully Qualified Domain Names, unless short hostnames are used throughout with rigorous consistency, for both clients and servers.
  4. Each network should have DNS service available, with forward and reverse name lookup implemented “round-trip-able” for all Hadoop servers and client hosts expected to use that network. It is common and allowable for the same hostname to resolve to various of the host’s multiple IP addresses, depending on which DNS service (on which network) is used.
  5. DNS services are a level of indirection for deriving both IP address and hostname info from the parameters specified in shared configuration files or from advertised values, to resolve to different networks for different querying processes. This allows, for example, external clients to find and talk to Datanodes on an external (client/server) network, while Datanodes find and talk to each other on an internal (intra-cluster) network.
  6. Each server should have one consistent hostname on all interfaces. Theoretically, DNS allows a server to have a different hostname per network interface. But it is required[2] for multi-homed Hadoop clusters that each server have only one hostname, consistent among all network interfaces used by Hadoop. (Network interfaces excluded from use by Hadoop may allow other hostnames.)
  7. If /etc/hosts files are used in lieu of DNS, then the above constraint to have the same hostname on all interfaces means that each /etc/hosts file can only specify one interface per server. Thus, clients at least are likely to have different /etc/hosts files than servers, and the allowable choices for use of the different networks may be constrained. If these constraints are not acceptable, provide DNS services.

Prior to 2014 there were a variety of parameters proposed for specific issues related to multi-homing. These were incremental band-aids with inconsistent naming and incomplete semantics. In mid-2014 (for hadoop-2.6.0) the Hadoop Core community resolved to address Multi-Homing completely, with consistent semantics and usages. Other components also evolved multi-homing support, with a variety of parameter name and semantics choices.

In general, for every service one must consider three sets of properties, related to the three key questions at the beginning of this section:

1. “Hostname Properties” relate to either specifying or deriving the server hostname by which clients will find each service, and the appropriate network for accessing it.

  • In some services (eg, HDFS Namenode and YARN ResourceManager) there may just be a parameter in a shared config file that hard-wires a specified server name. This approach is only practical for singleton services like masters, otherwise the config files would have to be edited differently on every node, instead of shared.
  • Many services (eg, Datanode, HBase services) use generic parameters or other means to perform a self-name-lookup via reverse-DNS, followed by registration (with Master or ZK) where the hostname gets advertised to clients.
  • Sometimes the hostname is neither specified nor advertised. Clients just need to know somehow, external to the config system, "based on human interaction".
  • Each client does a DNS lookup on the server hostname to get the IP address (and the associated network) on which to communicate with that service. So this client aspect of the network architecture decisions are expressed in the DNS server configuration rather than the Hadoop configuration info.

2. “Bind Address Properties” provides optional additional addressing information which is used only by the service itself, and only in multi-homed servers. By default, services will bind only to the default network interface [3] on each server. The Bind Address Properties can cause the service to bind to a non-default network interface if desired, or most commonly the bind address can be set to “0.0.0.0”, causing the service to bind toall available network interfaces on its server.

  • The bind address usually takes the form of an IP address rather than a network interface name, to be type-compatible with the conventional “0.0.0.0” notation. However, some services allow specifying a network interface name.
  • The default interface choice, if bind address is unspecified, varies for different services. Choices include:
    • the default network interface
    • all interfaces
    • some other specific interface defined by other parameters.

3. “Kerberos Host Properties” provides the information needed to resolve the“_HOST” substitution in each service’s own host-dependent Kerberos principal name. This is separate from the “Hostname Property” because not all services advertise their hostname to clients, and/or may need to use a different means to determine hostname for security purposes.

  • Usually the Kerberos Host substitution name is derived from another self-name-lookup, similar to hostname from #1, but possibly on a different DNS service.
  • Sometimes the hostname info from #1 is used for Kerberos also.
  • Absent other information, the Kerberos Host will be assigned the default hostname from the default network interface, obtained from the Java call, InetAddress.getLocalHost().getCanonicalHostName()

Different services evolved different parameters, or combinations of parameters, and parameter naming conventions, but all services need to somehow present the above three property sets.

The “listener” port number for each service must also be managed, but this property does not change in single- vs multi-home clusters. So we will not discuss it further here, except to mention that in HDFS and YARN it is conjoined with the “address” parameters (Hostname Property), while in other services such as HBase, it is in separate “port” parameters.

Parameters

HDFS

Service Parameters

For historical reasons, in HDFS and YARN most of the parameters related to Hostname Properties end in “-address” or “.address”. Indeed, in non-multi-homed clusters it is acceptable and was once common to simply use IP addresses rather than hostname strings. In multi-homed clusters, however, all server addressing must be hostname-based. In HDFS and YARN most of the parameters related to Bind Address Properties end in “-bind-host” or “.bind-host”.

The four Namenode services each have a dfs.namenode.${service}-address parameter, where ${service} is rpc, servicerpc, http, or https. The Namenode server’s hostname should be assigned here, to establish the Hostname Property for each service. This information is provided to clients via shared config files.

For each, there is an optional corresponding dfs.namenode.${service}-bind-host parameter to establish the Bind Address Properties. It may be set to the server’s IP address on a particular interface, or to “0.0.0.0” to cause the Namenode services to listen on all available network interfaces. If unspecified, the default interface will be used. These parameters are documented at: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html

  • dfs.namenode.rpc-bind-host -- controls binding for dfs.namenode.rpc-address
  • dfs.namenode.servicerpc-bind-host -- controls binding for dfs.namenode.servicerpc-address
  • dfs.namenode.http-bind-host -- controls binding for dfs.namenode.http-address
  • dfs.namenode.https-bind-host -- controls binding for dfs.namenode.https-address

The Datanodes also present various services; however, their Hostname Properties are established via self-name-lookup as follows:

  1. If present, the hadoop.security.dns.interface and hadoop.security.dns.nameserver parameters are used in the same way as documented in the Security Parameters section below, to determine the Datanode’s hostname. (If the rDNS lookup fails, it will fall back to /etc/hosts.) These two parameters are relatively new, available in Apache Hadoop 2.7.2 and later, and back-ported to HDP-2.2.9 and later, and HDP-2.3.4 and later.
  2. Otherwise if dfs.datanode.dns.interface and dfs.datanode.dns.nameserver parameters are present, they are used in the same way to determine the Datanode’s hostname. (If the rDNS lookup fails there, it willnot fallback to /etc/hosts.) This pair of parameters are (or will shortly be) deprecated, but many field installations currently use them.
  3. If neither set of parameters is present, then the Datanode calls Java InetAddress.getLocalHost().getCanonicalHostName() to determine its hostname.

The Datanode registers its hostname with the Namenode, and the Namenode also captures the Datanode’s IP address from which it registers. (This will be an IP address on the network interface the Datanode determined it should use to communicate with the Namenode.) The Namenode then provides both hostname and IP address to clients, for particular Datanodes, whenever a client operation is needed. Which one the client uses is determined by the dfs.client.use.datanode.hostname and dfs.datanode.use.datanode.hostname parameter values; see below.

Datanodes also have a set of dfs.datanode.${service}-address parameters. However, confusingly, these relate only to the Bind Address Property, not the Hostname Property, for Datanodes. They default to “0.0.0.0”, and it is okay to leave them that way unless it is desired to bind to only a single network interface.

Both Namenodes and Datanodes resolve their Kerberos _HOST bindings by using the hadoop.security.dns.interface and hadoop.security.dns.nameserver parameters, rather than their various *-address parameters. See below under “Security Parameters” for documentation.

Clients probably need to use Hostnames when connecting to Datanodes

By defaultHDFS clients connect to Datanodes using an IP address provided by the Namenode. In a multi-homed network, depending on the network configuration, the Datanode IP address known to the Namenode may be unreachable by the clients. The fix is letting clients perform their own DNS resolution of the Datanode hostname. The following setting enables this behavior. The reason this is optional is to allow skipping DNS resolution for clients on networks where the feature is not required (and DNS might not be a reliable/low-latency service).

  • dfs.client.use.datanode.hostname should be set to true when this this behavior is desired.

Datanodes may need to use Hostnames when connecting to other Datanodes:

More rarely, the Namenode-resolved IP address for a Datanode may be unreachable from other Datanodes. The fix is to force Datanodes to perform their own DNS resolution for inter-Datanode connections. The following setting enables this behavior.

  • dfs.datanode.use.datanode.hostname should be set to true when this this behavior is desired.

Client interface specification

Usually only one network is used for external clients to connect to HDFS servers. However, if the client can “see” the HDFS servers on more than one interface, you can control which interface or interfaces are used for HDFS communication by setting this parameter:

<property>
  <name>dfs.client.local.interfaces</name>
  <description>   A comma separated list of network interface names to use for data transfer
    between the client and datanodes. When creating a connection to read from or
    write to a datanode, the client chooses one of the specified interfaces at random
    and binds its socket to the IP of that interface. Individual names may be
    specified as either an interface name (eg "eth0"), a subinterface name (eg
    "eth0:0"), or an IP address (which may be specified using CIDR notation to match
    a range of IPs).
  </description>
</property>

Yarn & MapReduce

Service Parameters

The yarn and job history “*.address” parameters specify hostname to be used for each service and sub-service. This information is provided to clients via shared config files.

The optional yarn and job history “*.bind-host” parameters control bind address for sets of related sub-services as follows. Setting them to 0.0.0.0 causes the corresponding services to listen on all available network interfaces.

  • yarn.resourcemanager.bind-host -- controls binding for:
    • yarn.resourcemanager.address
    • yarn.resourcemanager.webapp.address
    • yarn.resourcemanager.webapp.https.address
    • yarn.resourcemanager.resource-tracker.address
    • yarn.resourcemanager.scheduler.address
    • yarn.resourcemanager.admin.address
  • yarn.nodemanager.bind-host-- controls binding for:
    • yarn.nodemanager.address
    • yarn.nodemanager.webapp.address
    • yarn.nodemanager.webapp.https.address
    • yarn.nodemanager.localizer.address
  • yarn.timeline-service.bind-host -- controls binding for:
    • yarn.timeline-service.address
    • yarn.timeline-service.webapp.address
    • yarn.timeline-service.webapp.https.address
  • mapreduce.jobhistory.bind-host -- controls binding for:
    • mapreduce.jobhistory.address
    • mapreduce.jobhistory.admin.address
    • mapreduce.jobhistory.webapp.address
    • mapreduce.jobhistory.webapp.https.address

Yarn and job history services resolve their Kerberos _HOST bindings by using the hadoop.security.dns.interface and hadoop.security.dns.nameserver parameters, rather than their various *-address parameters. See below under “Security Parameters” for documentation.

Security Parameters

Parameters for _HOST resolution

Each secure host needs to be able to perform _HOST resolution within its own parameter values, to correctly manage Kerberos principals. For this it must determine its own name. By default it will use the canonical name, which is associated with its first network interface. If it is desired to use the hostname associated with a different interface, use these two parameters in core-site.xml.

Core Hadoop services perform a self-name-lookup using these parameters. Many other Hadoop-related services define their own means of determining the Kerberos host substitution name.

These two parameters are relatively new, available in Apache Hadoop 2.7.2 and later, and back-ported to HDP-2.2.9 and later, and HDP-2.3.4 and later. In previous versions of HDFS, the dfs.datanode.dns.{interface,nameserver} parameter pair were used.

<property>
  <name>hadoop.security.dns.interface</name>
  <description>
    The name of the Network Interface from which the service should determine
    its host name for Kerberos login. e.g. “eth2”. In a multi-homed environment,
    the setting can be used to affect the _HOST substitution in the service
    Kerberos principal. If this configuration value is not set, the service
    will use its default hostname as returned by
    InetAddress.getLocalHost().getCanonicalHostName().
    Most clusters will not require this setting.
  </description>
</property>
<property>
  <name>hadoop.security.dns.nameserver</name>
  <description>
    The hostname or IP address of the DNS name server which a service Node
    should use to determine its own hostname for Kerberos Login, via a
    reverse-DNS query on its own IP address associated with the interface
    specified in hadoop.security.dns.interface.
    Requires hadoop.security.dns.interface.
    Most clusters will not require this setting.
  </description>
</property>

Parameters for Security Token service host resolution

In secure multi-homed environments, the following parameter will need to be set to false (it is true by default) on both cluster servers and clients (see HADOOP-7733), in core-site.xml. If it is not set correctly, the symptom will be inability to submit an application to YARN from an external client (with error “client host not a member of the Hadoop cluster”), or even from an in-cluster client if server failover occurs.

<property>
  <name>hadoop.security.token.service.use_ip</name>
  <description>
    Controls whether tokens always use IP addresses. DNS changes will not
    be detected if this option is enabled. Existing client connections that
    break will always reconnect to the IP of the original host. New clients
    will connect to the host's new IP but fail to locate a token.
    Disabling this option will allow existing and new clients to detect
    an IP change and continue to locate the new host's token.
    Defaults to true, for efficiency in single-homed systems.
    If setting to false, be sure to change in both clients and servers.
    Most clusters will not require this setting.
  </description>
</property>

Secure Hostname Lookup when /etc/hosts is used instead of DNS service

To prevent a security loophole described in http://ietf.org/rfc/rfc1535.txt, the Hadoop Kerberos libraries in Secure environments with hadoop.security.token.service.use_ip set to false will sometimes do a DNS lookup on a special form of the hostname FQDN that has a “.” appended. For instance, instead of using the FQDN “host1.lab.mycompany.com” it will use the form “host1.lab.mycompany.com.”. [sic] This is called the “absolute rooted” form of the hostname, sometimes written FQDN+".". For an explanation of the need for this form, see the above referenced RFC 1535.

If DNS services are set up as required for multi-homed Hadoop clusters, no special additional setup is needed for this concern, because all modern DNS servers already understand the absolute rooted form of hostnames. However, /etc/hosts processing of name lookup does not. If DNS is not available and /etc/hosts is being used in lieu, then the following additional setup is required:

In the /etc/hosts file of all cluster servers and client hosts, for the entries related to all hostnames of cluster servers, add an additional alias consisting of the FQDN+".". For example, a line in /etc/hosts that would normally look like

39.0.64.1 selena1.labs.mycompany.com selena1 hadoopvm8-1

should be extended to look like

39.0.64.1 selena1.labs.mycompany.com selena1.labs.mycompany.com. selena1 hadoopvm8-1

If the absolute rooted hostname is not included in /etc/hosts when required, the symptom will be “Unknown Host” error (see https://wiki.apache.org/hadoop/UnknownHost#UsingHostsFiles).

HBase

The documentation of multi-homing related parameters in the HBase docs is largely lacking, and in some cases incorrect. The following is based on the most up-to-date information available as of January 2016.

Credit: Josh Elser worked out the correct information for this document, by studying the source code. Thanks, Josh!

Service Parameters

Both HMaster and RegionServers advertise their hostname with the cluster’s Zookeeper (ZK) nodes. These ZK nodes are found using the hbase.zookeeper.quorum parameter value. The ZK server names found here are looked up via DNS on the default interface, by all HBase ZK clients. (Note the DNS server on the default interface may return an address for the ZK node that is on some other network interface.)

The HMaster determines its IP address on the interface specified in the hbase.master.dns.interface parameter, then does a reverse-DNS query on the DNS server at hbase.master.dns.interface. However, it does not use the resulting hostname directly. Instead, it does a forward-DNS lookup of this name on the default (or bind address) network interface, followed by a reverse-DNS lookup of the resulting IP address on the same interface’s DNS server. The resulting hostname is what the HMaster actually advertises with Zookeeper.

  • This sequence may be an attempt to correctly handle environments that don’t adhere to the “one consistent hostname on all interfaces” rule.
  • Regardless, if the “one consistent hostname on all interfaces” rule has been followed, it should be easy to predict the resulting Hostname Property value.

Each RegionServer obtains the HMaster’s hostname from Zookeeper, and looks it up with DNS on the default network. It then registers itself with the HMaster. When it does so, the HMaster captures the IP address of the RegionServer, and does a reverse-DNS lookup on the default (or HMaster bind address) interface to obtain a hostname for the RegionServer. It then delivers this hostname back to the RegionServer, which in turn advertises it on Zookeeper.

  • Thus, no config parameters are relevant to the RegionServer’s Hostname Property.
  • Again, it is important that each RegionServer’s hostname be the same on all interfaces if different clients are to use different interfaces, because the HMaster looks it up on only one.

Regardless of the above determination of hostname, the Bind Address Property may change the IP address binding of each service, to a non-default interface, or to all interfaces (0.0.0.0). The parameters for Bind Address Property are as follows, for each service:

HMaster RPC hbase.master.ipc.address
HMaster WebUI hbase.master.info.bindAddress
RegionServer RPC hbase.regionserver.ipc.address
RegionServer WebUI hbase.regionserver.info.bindAddress
REST RPC hbase.rest.host
REST WebUI hbase.rest.info.bindAddress
Thrift1 RPC hbase.regionserver.thrift.ipaddress
Thrift2 RPC hbase.thrift.info.bindAddress

Notes:

  • The WebUI services do not advertise their hostnames separately from the RPC services.
  • The REST and Thrift services do not advertise their hostnames at all; their clients must simply know “from human interaction” where to find the corresponding servers.

Finally, services and clients use a dns.{interface,nameserver} parameter pair to determine their Kerberos Host substitution names, as follows. See hadoop.security.dns.interface and hadoop.security.dns.nameserver in the “Security Parameters” section for documentation of the semantics of these parameter pairs.

HMaster RPC hbase.master.dns.{interface,nameserver}
HMaster WebUI N/A
RegionServer RPC hbase.regionserver.dns.{interface,nameserver}
RegionServer WebUI N/A
REST RPC N/A
REST WebUI N/A
Thrift1 RPC hbase.thrift.dns.{interface,nameserver}
Thrift2 RPC hbase.thrift.dns.{interface,nameserver}
HBase Client hbase.client.dns.{interface,nameserver}

Lastly, there is a bit of complexity with ZK node names, if the HBase-embedded ZK quorum is used. When HBase is instantiating an embedded set of ZK nodes, it is necessary to correlate the node hostnames with the integer indices in the “zoo.cfg” configuration. As part of this, ZK has a hbase.zookeeper.dns.{interface,nameserver} parameter pair. Each ZK node uses these in the usual way to determine its own hostname, which is then fed into the embedded ZK configuration process.

Superseded Parameters

Superseded by hadoop.security.dns.interface and hadoop.security.dns.nameserver, in Apache Hadoop 2.7.2 and later (back-ported to HDP-2.2.9 and later, and HDP-2.3.4 and later):

  • dfs.datanode.dns.interface
  • dfs.datanode.dns.nameserver

Superceded by Namenode HA:

  • dfs.namenode.secondary.http-address and https-address
    • only used in non-HA systems that have two NN masters yet have not configured HA[4]

Superceded hdfs/yarn/mapreduce parameters used by older systems:

  • yarn.nodemanager.webapp.bind-host: "0.0.0.0",
  • yarn.nodemanager.webapp.https.bind-host: "0.0.0.0",
    • just use yarn.nodemanager.bind-host: "0.0.0.0"
  • yarn.resourcemanager.admin.bind-host: "0.0.0.0",
  • yarn.resourcemanager.resource-tracker.bind-host: "0.0.0.0",
  • yarn.resourcemanager.scheduler.bind-host: "0.0.0.0",
    • just use yarn.resourcemanager.bind-host: "0.0.0.0"
  • yarn.timeline-service.webapp.bind-host: "0.0.0.0",
  • yarn.timeline-service.webapp.https.bind-host: "0.0.0.0",
    • just use yarn.timeline-service.bind-host: "0.0.0.0"
  • mapreduce.jobhistory.admin.bind-host: "0.0.0.0",
  • mapreduce.jobhistory.webapp.bind-host: "0.0.0.0",
  • mapreduce.jobhistory.webapp.https.bind-host: "0.0.0.0",
    • just use mapreduce.jobhistory.bind-host: "0.0.0.0"

Older historical items superceded:

  • dfs.namenode.master.name
    • predecessor of namenode bind-host parameters, and/or rule of “one consistent hostname on all interfaces”.
  • dfs.datanode.hostname
    • predecessor of “one consistent hostname on all interfaces” rule.

Note

Regarding the blog at: http://hortonworks.com/blog/multihoming-on-hadoop-yarn-clusters/

In “Taking a Test Drive” section of this blog it suggests using EC2 with Public and Private networks. We now know this doesn’t work, since by default in AWS, Public addresses are just NAT’ed to Private. Instead, one must use VPC capability with 2 Private networks. Should fix this document.


[1] Unless you really, really know what you’re doing, and want to work through every little nuance of the semantics of all the parameters. About the only situation in which you might get away with using IP addresses for server addressing is if ALL services are being bound to specific single interfaces, not using 0.0.0.0 at all.

[2] Having one consistent hostname on all interfaces is not a strict technical requirement of Hadoop. But it is a major simplifying assumption, without which the Hadoop administrator is likely to land in a morass of very complex and mutually conflicting parameter choices. For this reason, Hortonworks only supports using one consistent hostname per server on all interfaces used by Hadoop.

[3] In non-virtual Linux systems, the default network interface is typically “eth0” or “eth0:0”, but this conventional naming can be changed in system network configuration or with network virtualization.

[4] Not sure why the resources would be committed for a secondary NN, but not set up HA. However, this is still a supported configuration.

18,176 Views
Comments
avatar
Explorer

This is a great article.

I had trouble with Ambari server outside the private network (where all the master and data nodes are). There are two approaches to fix this. Ambari server can be brought into the private network (recommended) or the network on which agents are listening to, should be made primary. Basically, the hostname that the ambari server resolves to, should be what the agents are communicating to.

avatar
Contributor

The article doesn't indicate this, so for reference, the listed HDFS settings do not exist by default. These settings, as shown below, need to go into hdfs-site.xml, which is done in Ambari by adding fields under "Custom hdfs-site".

dfs.namenode.rpc-bind-host=0.0.0.0
dfs.namenode.servicerpc-bind-host=0.0.0.0
dfs.namenode.http-bind-host=0.0.0.0
dfs.namenode.https-bind-host=0.0.0.0

Additionally, I found that after making this change, both NameNodes under HA came up as stand-by; the article at https://community.hortonworks.com/articles/2307/adding-a-service-rpc-port-to-an-existing-ha-cluste.h... got me the missing step of running a ZK format.

I have not tested the steps below against a Production cluster and if you foolishly choose to follow these steps, you do so at a very large degree of risk (you could lose all of the data in your cluster). That said, this worked for me in a non-Prod environment:

01) Note the Active NameNode.

02) In Ambari, stop ALL services except for ZooKeeper.

03) In Ambari, make the indicated changes to HDFS.

04) Get to the command line on the Active NameNode (see Step 1 above).

05) At the command line you opened in Step 4, run: `sudo -u hdfs hdfs zkfc -formatZK`

06) Start the JournalNodes.

07) Start the zKFCs.

08) Start the NameNodes, which should come up as Active and Standby. If they don't, you're on your own (see the "high risk" caveat above).

09) Start the DataNodes.

10) Restart / Refresh any remaining HDFS components which have stale configs.

11) Start the remaining cluster services.

It would be great if HWX could vet my procedure and update the article accordingly (hint, hint).