Member since
07-30-2019
111
Posts
186
Kudos Received
35
Solutions
09-08-2021
02:43 PM
This whole series is really insightful and helpful!
... View more
02-27-2018
10:20 PM
2 Kudos
Building Apache Tez with Apache Hadoop 2.8.0 or later fails due to client/server jar separation in Hadoop [1]. The build fails with the following error: [ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /src/tez/tez-api/src/test/java/org/apache/tez/client/TestTezClientUtils.java:[48,30] cannot find symbol
symbol: class DistributedFileSystem
location: package org.apache.hadoop.hdfs
[ERROR] /src/tez/tez-api/src/test/java/org/apache/tez/client/TestTezClientUtils.java:[680,50] cannot find symbol
symbol: class DistributedFileSystem
location: class org.apache.tez.client.TestTezClientUtils
[ERROR] /src/tez/ecosystem/tez/tez-api/src/test/java/org/apache/tez/common/TestTezCommonUtils.java:[62,42] cannot access org.apache.hadoop.hdfs.DistributedFileSystem To get Tez to compile successfully, you will need to use the new hadoop28 profile introduced by TEZ-3690 [2]. E.g. here is how you compile Tez with Apache Hadoop 3.0.0: mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true -Phadoop28 -Dhadoop.version=3.0.0 References: 1. HDFS-6200: Create a separate jar for hdfs-client 2. TEZ-3690: Tez on hadoop 3 build failed due to hdfs client/server jar separation.
... View more
Labels:
06-22-2017
07:46 PM
7 Kudos
The HDFS NameNode ensures that each block is sufficiently replicated. When it detects the loss of a DataNode, it instructs remaining nodes to maintain adequate replication by creating additional block replicas.
For each lost replica, the NameNode picks a (source, destination) pair where the source is an available DataNode with another replica of the block and the destination is the target for the new replica. The re-replication work can be massively parallelized in large clusters since the replica distribution is randomized.
In this article, we estimate a lower bound for the recovery time. Simplifying Assumptions
The maximum IO bandwidth of each disk is 100MB/s (reads + writes). This is true for the vast majority of clusters that use spinning disks. The aggregate IO capacity of the cluster is limited by disk and not the network. This is not always true but helps us establish lower bounds without discussing network topologies. Block replicas are uniformly distributed across the cluster and disk usage is uniform. True if the HDFS balancer was run recently.
Theoretical Lower Bound
Let's assume the cluster has n nodes. Each each node has p disks, and the usage of each disk is c TeraBytes. The data usage of each node is thus (p ⋅ c) TB.
The amount of data data transfer needed for recovery is twice the capacity of the lost DataNode as each replica must be read once from a source disk and written once to the target disk.
Data transfer during recovery = 2 ⋅ (Node Capacity)
= (2 ⋅ p ⋅ c) TB
= (2 ⋅ p ⋅ c ⋅ 1,000,000) MB
The re-replication rate is the limited by the available aggregate IO bandwidth in the cluster: Cluster aggregate IO bandwidth = (Disk IO bandwidth) ⋅ (Number of disks)
= (100 ⋅ n ⋅ p) MB/s Thus Minimum Recovery Time = (Data transfer during recovery) / (Cluster aggregate IO bandwidth)
= (2 ⋅ p ⋅ c ⋅ 1,000,000) / (100 ⋅ n ⋅ p)
= (20,000 ⋅ c/n) seconds.
where: c = Mean usage of each disk in TB.
n = Number of DataNodes in the cluster. This is the absolute best case with no other load, no network bandwidth limits, and a perfectly efficient scheduler.
E.g. In a 100 node cluster where each disk has 4TB of data, recovery from the loss of a DataNode must take at least (20,000 ⋅ 4) / 100 = 800 seconds or approximately 13 minutes.
Clearly, the cluster size bounds the recovery time. Disk capacities being equal, a 1000 node cluster can recover 10x faster than a 100 node cluster. A More Practical Lower Bound
The theoretical lower bound assumes that block re-replications can be instantaneously scheduled across the cluster. It also assumes that all cluster IO capacity is available for re-replication whereas in practice application reads and writes also consume IO capacity. The NameNode schedules 2 outbound replication streams per DataNode, per heartbeat interval to throttle re-replication traffic. This throttle allows DataNodes to remain responsive to applications. The throttle can be adjusted via the configuration setting dfs.namenode.replication.max-streams. Let's call this m and the heartbeat interval h.
Also let's assume the mean block size in the cluster is b MB. Then: Re-replication Rate = Blocks Replicated cluster-wide per heartbeat interval
= (n ⋅ m/h) Blocks/s
The total number of blocks to be re-replicated is the capacity of the lost node divided by the mean block size. Number of Blocks Lost = (p ⋅ c) TB / b MB
= (p ⋅ c ⋅ 1,000,000/b).
Thus:
Recovery Time = (Number of Blocks Lost) / (Re-replication Rate)
= (p ⋅ c ⋅ 1,000,000) / (b ⋅ n ⋅ m/h)
= (p ⋅ c ⋅ h ⋅ 1,000,000) / (b ⋅ n ⋅ m) seconds.
where: p = Number of disks per node.
c = Mean usage of each disk in TB.
h = Heartbeat interval (default = 3 seconds).
b = Mean block size in MB.
n = Number of DataNodes in the cluster.
m = dfs.namenode.replication.max-streams (default = 2)
Simplifying by plugging in the defaults for h and m, we get
Minimum Recovery Time (seconds) = (p ⋅ c ⋅ 1,500,000) / (b ⋅ n)
E.g. in the same cluster, assuming the mean block size is 128MB and each node has 8 disks, the practical lower bound on recovery time will be 3,750 seconds or ~1 hour. Reducing the Recovery Time
The recovery time can be reduced by:
Increasing dfs.namenode.replication.max-streams. However, setting this value too high can affect cluster performance. Note that increasing this value beyond 4 must be evaluated carefully and also requires changing the safeguard upper limit via dfs.namenode.replication.max-streams-hard-limit.
Using more nodes with smaller disks. Total cluster capacity remaining the same, a cluster with more nodes and smaller disks will recover faster.
Avoiding predominantly small blocks.
... View more
Labels:
07-26-2016
12:16 AM
1 Kudo
You may run into slow Hadoop service start on your OS X development laptop. You can check this by opening up your service logs and looking for large (5-10 second) gaps between successive log entries at startup. Diagnosis It often manifests as test failures for MiniDFSCluster-based tests that use short timeouts (<10 seconds). Here is an example from a NameNode log file with a 5 second stall at startup. 2016-07-25 14:57:37,982 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2016-07-25 14:57:43,060 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). Another 5 second stall during NameNode startup. 2016-07-25 14:57:48,790 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Append Enabled: true
2016-07-25 14:57:53,914 INFO org.apache.hadoop.util.GSet: Computing capacity for map INodeMap Resolution If you see this behavior you are likely running into an OS X bug. The fix is to put all your entries for localhost on one line as described in this StackOverflow answer. i.e. Make sure your /etc/hosts file has something like this: # Replace myhostname with the hostname of your laptop.
#
127.0.0.1 localhost myhostname myhostname.local myhostname.Home Instead of this: 127.0.0.1 localhost myhostname.local
127.0.0.1 myhostname myhostname.Home
Root Cause The root cause of this problem appears to be a long delay when looking up the local host name with InetAddress.getLocalHost. The following code is a minimal repro of this problem on affected systems. import java.net.*;
class Lookup {
public static void main(String[] args) throws Exception {
System.out.println(InetAddress.getLocalHost().getCanonicalHostName());
}
} This program can take over 5 seconds to execute on an affected machine. Verified on OS X 10.10.5 with Oracle JDK 1.8.0_91 and 1.7.0_79.
... View more
Labels:
06-23-2017
04:29 PM
Thanks for the the very useful article. Will there be any follow ups coming? I am particularly interested in the changes that came about from https://issues.apache.org/jira/browse/HDFS-8818 and how settings like dfs.balancer.moverThreads needs to be increased from default when balancing a large number of unbalanced nodes. (e.g the setting used in the comment in HDFS-8188 herehttps://issues.apache.org/jira/browse/HDFS-8818?focusedCommentId=15997429&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15997429)
... View more
11-08-2018
06:28 PM
Hi @Arpit Agarwal, In HADOOP-10597 the earliest version mentioned with the fix is 2.7.4. Is the HDP version recommended in the post correct? I'm running HDP 2.6.1 with Hadoop 2.7.3 and I would like to confirm if this parameter can be enabled or not. Thanks!
... View more
07-07-2016
04:13 AM
16 Kudos
Introduction
This article continues where part 1 left off. It describes a few more configuration settings that can be enabled in CDP, HDP, CDH, or Apache Hadoop clusters to help the NameNode scale better.
Audience
This article is for Hadoop administrators who are familiar with HDFS and its components. If you are using Ambari or Cloudera Manager you should know how to manage services and configurations. It is assumed that you have read part 1 of the article.
RPC Handler Count
The Hadoop RPC server consists of a single RPC queue per port and multiple handler (worker) threads that dequeue and process requests. If the number of handlers is insufficient, then the RPC queue starts building up and eventually overflows. You may start seeing task failures and eventually job failures and unhappy users.
It is recommended that the RPC handler count be set to 20 * log2(Cluster Size) with an upper limit of 200.
e.g. for a 250 node cluster you should initialize this to 20 * log2(250) = 160. The RPC handler count can be configured with the following setting in hdfs-site.xml.
<property>
<name>dfs.namenode.handler.count</name>
<value>160</value>
</property>
This heuristic is from the excellent Hadoop Operations book. If you are using Ambari to manage your cluster this setting can be changed via a slider in the Ambari Server Web UI. If you're using Cloudera Manager you can search for property name "dfs.namenode.service.handler.count" under the HDFS configuration page and adjust the value.
Service RPC Handler Count
Pre-requisite: If you have not enabled the Service RPC port already, please do so first as described here.
There is no precise calculation for the Service RPC handler count however the default value of 10 is too low for most production clusters. We have often seen this initialized to 50% of the dfs.namenode.handler.count in busy clusters and this value works well in practice.
e.g. for the same 250 node cluster you would initialize the service RPC handler count with the following setting in hdfs-site.xml.
<property>
<name>dfs.namenode.service.handler.count</name>
<value>80</value>
</property>
DataNode Lifeline Protocol
The Lifeline protocol is a feature recently added by the Apache Hadoop Community (see Apache HDFS Jira HDFS-9239). It introduces a new lightweight RPC message that is used by the DataNodes to report their health to the NameNode. It was developed in response to problems seen in some overloaded clusters where the NameNode was too busy to process heartbeats and spuriously marked DataNodes as dead.
For a non-HA cluster, the feature can be enabled with the following setting in hdfs-site.xml (replace mynamenode.example.com with the hostname or IP address of your NameNode). The port number can be different too.
<property>
<name>dfs.namenode.lifeline.rpc-address</name>
<value>mynamenode.example.com:8050</value>
</property>
For an HA cluster, the lifeline RPC port can be enabled with settings like the following, replacing mycluster, nn1 and nn2 appropriately.
<property>
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name>
<value>mynamenode1.example.com:8050</value>
</property>
<property>
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name>
<value>mynamenode2.example.com:8050</value>
</property>
Additional lifeline protocol settings are documented in the HDFS-9239 release note but these can be left at their default values for most clusters.
Note: Changing the lifeline protocol settings requires a restart of the NameNodes, DataNodes and ZooKeeper Failover Controllers to take full effect. If you have NameNode HA setup, you can restart the NameNodes one at a time followed by a rolling restart of the remaining components to avoid cluster downtime.
Conclusion
That is it for Part 2. In Part 3 of this article, we will explore how to enable two new HDFS features to help your NameNode scale better.
... View more
Labels:
10-04-2018
09:53 AM
@Elias Abacioglu You can refer below guidance for configuring service port. https://community.hortonworks.com/articles/223817/how-do-you-enable-namenode-service-rpc-port-withou.html
... View more
10-28-2015
05:22 PM
4 Kudos
Configuring a separate service RPC port can improve the responsiveness of the NameNode by allowing DataNode and client requests to be processed via separate RPC queues. Adding a service RPC port to an HA cluster with automatic failover via ZKFCs requires a workaround step. The steps to configure a service RPC port are as follows: Add the following settings to hdfs-site.xml. <property>
<name>dfs.namenode.servicerpc-address.mycluster.nn1</name>
<value>nn1.example.com:8040</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.mycluster.nn2</name>
<value>nn2.example.com:8040</value>
</property> 2. Restart NameNodes. 3. Stop the ZKFC processes on both NameNodes 4. Run the following command to reset the ZKFC state in ZooKeeper hdfs zkfc -formatZK Without this step you will see the following exception after ZKFC restart. java.lang.RuntimeException: Mismatched address stored in ZK for NameNode 5. Restart the ZKFCs.
... View more
Labels: