Member since
09-28-2015
51
Posts
32
Kudos Received
17
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1424 | 04-13-2018 11:36 PM | |
3935 | 04-13-2018 11:03 PM | |
1188 | 04-13-2018 10:56 PM | |
3686 | 04-10-2018 03:12 PM | |
4695 | 02-13-2018 07:23 PM |
04-19-2017
07:12 PM
It is likely another instance of HFDS-11608 where the block size is set too big (> 2GB). The overflow issue was recently fixed by https://issues.apache.org/jira/browse/HDFS-11608.
... View more
02-22-2017
07:02 PM
1 Kudo
Can you try "export
HADOOP_ROOT_LOGGER=TRACE,console" before running "hdfs dfs -ls /"? That will reveal more end-to-end RPC related traces for the root cause.
... View more
08-19-2016
08:14 PM
spaceConsumed = length * replicationFactor
... View more
08-10-2016
07:52 PM
1 Kudo
Based on the error below, you should check your datanode (only 1) is running? If yes, ensure it is not listed in dfs.hosts.exclude from hdfs-site.xml and has enough space to save block files. java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] Caused by: org.apache.hadoop.ipc.RemoteException: File /email/headers/.506170560796063 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
... View more
07-11-2016
10:46 PM
2 Kudos
@Felix Albani You will need to provide the configuration file location with --config parameter like Ambari does. E.g. hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start datanode
... View more
07-06-2016
08:40 PM
4 Kudos
We have seem many incidents of overloaded HDFS namenode due to 1) misconfigurations or 2) “bad” MR jobs or Hive queries that create large amount of RPC requests in a short period of time. There are quite a few features that have been introduced in HDP 2.3/2.4 to protect HDFS namenode. This article summarize the deployment steps of these features with an incomplete list of known issues and possible solutions for them.
Enable Async Audit Logging Dedicated Service RPC Port Dedicated Lifeline RPC Port for HA Enable FairCallQueue on Client RPC Port Enable RPC Client Backoff on Client RPC port Enable RPC Caller Context to track the “bad” jobs Enable Response time based backoff with DecayedRpcScheduler Check JMX for namenode client RPC call queue length and average queue time Check JMX for namenode DecayRpcScheduler when FCQ is enabled NNtop (HDFS-6982) 1. Enable Async Audit Logging Enable async audit logging by setting "dfs.namenode.audit.log.async" to true in hdfs-site.xml. This can minimize the impact of audit log I/Os on namenode performance. <property>
<name>dfs.namenode.audit.log.async</name>
<value>true</value>
</property> 2. Dedicated Service RPC Port Configuring a separate service RPC port can improve the responsiveness of the NameNode by allowing DataNode and client requests to be processed via separate RPC queues. Datanode and all other services should be connected to the new service RPC address and clients connect to the well known addresses specified by dfs.namenode.rpc-address. Adding a service RPC port to an HA cluster with automatic failover via ZKFCs (with/wo Kerberos) requires some additional steps as follows: Add the following settings to hdfs-site.xml. <property>
<name>dfs.namenode.servicerpc-address.mycluster.nn1</name>
<value>nn1.example.com:8040</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.mycluster.nn2</name>
<value>nn2.example.com:8040</value>
</property> 2. If the cluster is not Kerberos enabled, skip this step. If the cluster is kerberos enabled, create two new hdfs_jass.conf files for nn1 and nn2 and copy them to /etc/hadoop/conf/hdfs_jaas.conf, respectively nn1: Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true useTicketCache=false keyTab="/etc/security/keytabs/nn.service.keytab" principal="nn/c6401.ambari.apache.org@EXAMPLE.COM";}; nn2: Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true useTicketCache=false keyTab="/etc/security/keytabs/nn.service.keytab" principal="nn/c6402.ambari.apache.org@EXAMPLE.COM";}; Add the following to hadoop-env.sh export HADOOP_NAMENODE_OPTS="-Dzookeeper.sasl.client=true -Dzookeeper.sasl.client.username=zookeeper -Djava.security.auth.login.config=/etc/hadoop/conf/hdfs_jaas.conf -Dzookeeper.sasl.clientconfig=Client ${HADOOP_NAMENODE_OPTS}" 3. Restart NameNodes 4. Restart DataNodes to connect to the new NameNode service RPC port instead of the NameNode client RPC port . 5. Stop the ZKFC processes on both NameNodes 6. Run the following command to reset the ZKFC state in ZooKeeper hdfs zkfc -formatZK Known issues: 1. Without step 6 you will see the following exception after ZKFC restart. java.lang.RuntimeException:Mismatched address stored in ZK forNameNode 2. Without step 2 in a Kerberos enabled HA cluster, you will see the following exception when running step 6. 16/03/23 03:30:53 INFO ha.ActiveStandbyElector: Recursively deleting /hadoop-ha/hdp64ha from ZK...16/03/23 03:30:53 ERROR ha.ZKFailoverController: Unable to clear zk parent znodejava.io.IOException: Couldn't clear parent znode /hadoop-ha/hdp64haat org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:380)at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:267)at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:212)at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:360)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:442)at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:183)
Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /hadoop-ha/hdp64haat org.apache.zookeeper.KeeperException.create(KeeperException.java:125)at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:54)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:375)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:372)at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1041)at org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:372)
... 11 more 3. Dedicated Lifeline RPC Port for HA HDFS-9311 allows using a separate RPC address to isolate health checks and liveness from client RPC port which could be exhausted due to “bad” jobs. Here is an example to configure this feature in a HA cluster. <property>
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name>
<value>nn1.example.com:8050</value>
</property>
<property>
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name>
<value>nn1.example.com:8050</value>
</property>
... View more
Labels:
06-07-2016
09:06 PM
For misconfiguration like the cases above, you will find INFO level log like below: "The configured checkpoint interval is 0 minutes. Using an interval of XX (e.g., 60) minutes that is used for deletion instead"
... View more
06-07-2016
09:01 PM
1 Kudo
Yes, when fs.trash.checkpoint.interval=0 or not setting fs.trash.checkpoint.interval, fs.trash.interval will be used as checkpoint interval.
Also, the fs.trash.checkpoint.interval should always be set as smaller than the fs.trash.interval. If it is not, fs.trash.interval will be used as checkpoint interval similar to the case above.
... View more
06-02-2016
11:43 PM
2 Kudos
This looks like a network issue of your datanodes to handle the replication workload. Can you check the ifconfig output for MTU of all the datanodes and ensure it is consistently configured? Below is a short list from a tutorial by @mjohnson on network best practice, which could help you troubleshooting. https://community.hortonworks.com/articles/8563/typical-hdp-cluster-network-configuration-best-pra.html "Make certain all members to the HDP cluster have passwordless SSH configured. Basic heartbeat (typical 3x/second) and administrative commands generated by the Hadoop cluster are infrequent and transfer only small amounts of data except in the extremely large cluster deployments. Keep in mind that NAS disks will require more network utilization than plain old disk drives resident on the data node. Make certain both host Fully Qualified Host Names as well as Host aliases are defined and referenceable by all nodes within the cluster. Ensure the network interface is consistently defined for all members of the Hadoop cluster (i.e. MTU settings should be consistent) Look into defining MTU for all interfaces on the cluster to support Jumbo Frames (typically MTU=9000). But only do this make certain that all nodes and switches support this functionality. Inconsistent MTU or undefined MTU configurations can produce serious problems with the network. Disable Transparent Huge Page compaction for all nodes on the cluster. Make certain all all of the HDP cluster’s network connections are monitored for collisions and lost packets. Have the Network administration team tune the network as required to address any issues identified as part of the network monitoring."
... View more
11-12-2015
09:38 PM
2 Kudos
You can do hotswap introduced by HDFS-1362 to replace slave nodes disks without Decommission/Recommission(Restart). Ambari may not support this now. But you can always do that with hdfs command line. More details can be found from this link.
... View more
- « Previous
- Next »