Member since 
    
	
		
		
		09-28-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                51
            
            
                Posts
            
        
                32
            
            
                Kudos Received
            
        
                17
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1758 | 04-13-2018 11:36 PM | |
| 4509 | 04-13-2018 11:03 PM | |
| 1612 | 04-13-2018 10:56 PM | |
| 4154 | 04-10-2018 03:12 PM | |
| 5794 | 02-13-2018 07:23 PM | 
			
    
	
		
		
		04-19-2017
	
		
		07:12 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 It is likely another instance of HFDS-11608 where the block size is set too big (> 2GB). The overflow issue was recently fixed by https://issues.apache.org/jira/browse/HDFS-11608. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-22-2017
	
		
		07:02 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Can you try "export
HADOOP_ROOT_LOGGER=TRACE,console" before running "hdfs dfs -ls /"? That will reveal more end-to-end RPC related traces for the root cause. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-19-2016
	
		
		08:14 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 spaceConsumed = length * replicationFactor 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-10-2016
	
		
		07:52 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Based on the error below, you should check your datanode (only 1) is running? If yes, ensure it is not listed in dfs.hosts.exclude from hdfs-site.xml and has enough space to save block files.  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] Caused by: org.apache.hadoop.ipc.RemoteException: File /email/headers/.506170560796063 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-11-2016
	
		
		10:46 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 @Felix Albani You will need to provide the configuration file location with --config parameter like Ambari does.   E.g.  hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start datanode 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-06-2016
	
		
		08:40 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							 We have seem many incidents of overloaded HDFS namenode due to 1) misconfigurations or 2) “bad” MR jobs or Hive queries that create large amount of RPC requests in a short period of time. There are quite a few features that have been introduced in HDP 2.3/2.4 to protect HDFS namenode. This article summarize the deployment steps of these features with an incomplete list of known issues and possible solutions for them.   
 Enable Async Audit Logging  Dedicated Service RPC Port  Dedicated Lifeline RPC Port for HA  Enable FairCallQueue on Client RPC Port  Enable RPC Client Backoff on Client RPC port  Enable RPC Caller Context to track the “bad” jobs  Enable Response time based backoff with DecayedRpcScheduler  Check JMX for namenode client RPC call queue length and average queue time  Check JMX for namenode DecayRpcScheduler when FCQ is enabled  NNtop (HDFS-6982)   1. Enable Async Audit Logging  Enable async audit logging by setting "dfs.namenode.audit.log.async" to true in hdfs-site.xml. This can minimize the impact of audit log I/Os on namenode performance.  <property>  
  <name>dfs.namenode.audit.log.async</name>  
  <value>true</value>
</property>  2. Dedicated Service RPC Port  Configuring a separate service RPC port can improve the responsiveness of the NameNode by allowing DataNode and client requests to be processed via separate RPC queues. Datanode and all other services should be connected to the new service RPC address and clients connect to the well known addresses specified by dfs.namenode.rpc-address.   Adding a service RPC port to an HA cluster with automatic failover via ZKFCs (with/wo Kerberos) requires some additional steps as follows:   Add the following settings to hdfs-site.xml.   <property>
  <name>dfs.namenode.servicerpc-address.mycluster.nn1</name>
  <value>nn1.example.com:8040</value>
</property>
<property>
  <name>dfs.namenode.servicerpc-address.mycluster.nn2</name>
  <value>nn2.example.com:8040</value>
</property>  2. If the cluster is not Kerberos enabled, skip this step.  If the cluster is kerberos enabled, create two new hdfs_jass.conf files for nn1 and nn2 and copy them to /etc/hadoop/conf/hdfs_jaas.conf, respectively   nn1:  Client {  com.sun.security.auth.module.Krb5LoginModule required  useKeyTab=true  storeKey=true  useTicketCache=false  keyTab="/etc/security/keytabs/nn.service.keytab"  principal="nn/c6401.ambari.apache.org@EXAMPLE.COM";};  nn2:  Client {  com.sun.security.auth.module.Krb5LoginModule required  useKeyTab=true  storeKey=true  useTicketCache=false  keyTab="/etc/security/keytabs/nn.service.keytab"  principal="nn/c6402.ambari.apache.org@EXAMPLE.COM";};  Add the following to hadoop-env.sh  export HADOOP_NAMENODE_OPTS="-Dzookeeper.sasl.client=true  -Dzookeeper.sasl.client.username=zookeeper -Djava.security.auth.login.config=/etc/hadoop/conf/hdfs_jaas.conf -Dzookeeper.sasl.clientconfig=Client ${HADOOP_NAMENODE_OPTS}"  3. Restart NameNodes  4.  Restart DataNodes to connect to the new NameNode service RPC port instead of the NameNode client RPC port .  5. Stop the ZKFC processes on both NameNodes  6. Run the following command to reset the ZKFC state in ZooKeeper  hdfs zkfc -formatZK  Known issues:  1. Without step 6 you will see the following exception after ZKFC restart.  java.lang.RuntimeException:Mismatched address stored in ZK forNameNode  2. Without step 2 in a Kerberos enabled HA cluster, you will see the following exception when running step 6.  16/03/23 03:30:53 INFO ha.ActiveStandbyElector: Recursively deleting /hadoop-ha/hdp64ha from ZK...16/03/23 03:30:53 ERROR ha.ZKFailoverController: Unable to clear zk parent znodejava.io.IOException: Couldn't clear parent znode /hadoop-ha/hdp64haat org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:380)at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:267)at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:212)at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:360)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:442)at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:183)
Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /hadoop-ha/hdp64haat org.apache.zookeeper.KeeperException.create(KeeperException.java:125)at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:54)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:375)at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:372)at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1041)at org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:372)
... 11 more  3. Dedicated Lifeline RPC Port for HA  HDFS-9311 allows using a separate RPC address to isolate health checks and liveness from client RPC port which could be exhausted due to “bad” jobs. Here is an example to configure this feature in a HA cluster.  <property>  
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name>
<value>nn1.example.com:8050</value> 
</property>
<property>
  <name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name>
  <value>nn1.example.com:8050</value>
</property> 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		06-07-2016
	
		
		09:06 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 For misconfiguration like the cases above, you will find INFO level log like below:  "The configured checkpoint interval is 0 minutes. Using an interval of XX (e.g., 60) minutes that is used for deletion instead" 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-07-2016
	
		
		09:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Yes, when fs.trash.checkpoint.interval=0 or not setting fs.trash.checkpoint.interval, fs.trash.interval will be used as checkpoint interval.
  Also, the fs.trash.checkpoint.interval should always be set as smaller than the fs.trash.interval. If it is not, fs.trash.interval will be used as checkpoint interval similar to the case above.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-02-2016
	
		
		11:43 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 This looks like a network issue of your datanodes to handle the replication workload. Can you check the ifconfig output for MTU of all the datanodes and ensure it is consistently configured?  Below is a short list from a tutorial by @mjohnson on network best practice, which could help you troubleshooting.  https://community.hortonworks.com/articles/8563/typical-hdp-cluster-network-configuration-best-pra.html   "Make certain all members to the HDP cluster have passwordless SSH configured.   Basic heartbeat (typical 3x/second) and administrative commands generated by the Hadoop cluster are infrequent and transfer only small amounts of data except in the extremely large cluster deployments.  Keep in mind that NAS disks will require more network utilization than plain old disk drives resident on the data node.  Make certain both host Fully Qualified Host Names as well as Host aliases are defined and referenceable by all nodes within the cluster.  Ensure the network interface is consistently defined for all members of the Hadoop cluster (i.e. MTU settings should be consistent)  Look into defining MTU for all interfaces on the cluster to support Jumbo Frames (typically MTU=9000). But only do this make certain that all nodes and switches support this functionality. Inconsistent MTU or undefined MTU configurations can produce serious problems with the network.  Disable Transparent Huge Page compaction for all nodes on the cluster.  Make certain all all of the HDP cluster’s network connections are monitored for collisions and lost packets. Have the Network administration team tune the network as required to address any issues identified as part of the network monitoring."  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-12-2015
	
		
		09:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 You can do hotswap introduced by HDFS-1362 to replace slave nodes disks without Decommission/Recommission(Restart).   Ambari may not support this now. But you can always do that with hdfs command line.   More details can be found from this link. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		- « Previous
- Next »
 
        













