Member since 
    
	
		
		
		08-08-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                1652
            
            
                Posts
            
        
                30
            
            
                Kudos Received
            
        
                11
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1910 | 06-15-2020 05:23 AM | |
| 15411 | 01-30-2020 08:04 PM | |
| 2045 | 07-07-2019 09:06 PM | |
| 8090 | 01-27-2018 10:17 PM | |
| 4554 | 12-31-2017 10:12 PM | 
			
    
	
		
		
		08-20-2024
	
		
		02:37 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we have cluster with 12 Kafka machines and 3 zookeeper servers on Linux servers  - Kafka version is 2.7 version , ( broker and controller are co-hosted on the same PID )  as known Kafka have 2 important logs and they are **server.log** and **controller.log**  about **controller.log** , when we look on this log we can see the following words - "`Shutdown completed`" in the log  [2024-08-20 21:42:01,582] INFO [ControllerEventThread controllerId=1001] Shutdown completed (kafka.controller.ControllerEventManager$ControllerEventThread)  the first thinking when we see the messages about "`Shutdown completed`" - is like this message is "bad" and why controller stopped ...  but when we look on all machines most of the machines have this message as - `Shutdown completed` (`kafka.controller.ControllerEventManager$ControllerEventThread`)  **but on other hand** only one controller should be active from all brokers and maybe the messages as "`Shutdown completed`" are only indicate that controllers that are not active are in standby state and therefore are in state of - `Shutdown completed` ?  for example - here one of the log from one broker machine    [2024-08-20 21:23:18,084] DEBUG [Controller id=1001] Broker 1007 was elected as controller instead of broker 1001 (kafka.controller.KafkaController)  org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting controller startup procedure  [2024-08-20 21:33:51,281] DEBUG [Controller id=1001] Broker 1005 was elected as controller instead of broker 1001 (kafka.controller.KafkaController)  org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting controller startup procedure  [2024-08-20 21:42:01,581] INFO [ControllerEventThread controllerId=1001] Shutting down (kafka.controller.ControllerEventManager$ControllerEventThread)  [2024-08-20 21:42:01,582] INFO [ControllerEventThread controllerId=1001] Shutdown completed (kafka.controller.ControllerEventManager$ControllerEventThread)  [2024-08-20 21:42:01,582] INFO [ControllerEventThread controllerId=1001] Stopped (kafka.controller.ControllerEventManager$ControllerEventThread)  [2024-08-20 21:42:01,582] DEBUG [Controller id=1001] Resigning (kafka.controller.KafkaController)  [2024-08-20 21:42:01,583] DEBUG [Controller id=1001] Unregister BrokerModifications handler for Set() (kafka.controller.KafkaController)  [2024-08-20 21:42:01,604] INFO [PartitionStateMachine controllerId=1001] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)  [2024-08-20 21:42:01,608] INFO [ReplicaStateMachine controllerId=1001] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)  [2024-08-20 21:42:01,608] INFO [Controller id=1001] Resigned (kafka.controller.KafkaController)  [2024-08-20 21:43:45,196] INFO [ControllerEventThread controllerId=1001] Starting (kafka.controller.ControllerEventManager$ControllerEventThread)  [2024-08-20 21:43:45,208] DEBUG [Controller id=1001] Broker 1005 has been elected as the controller, so stopping the election process. (kafka.controller.KafkaController)  [2024-08-20 21:52:28,400] DEBUG [Controller id=1001] Broker 1001 was elected as controller instead of broker 1001 (kafka.controller.KafkaController)  org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting controller startup procedure  <---- LOG IS ENDED HERE       so the question is - can we ignore the messages as `INFO [ControllerEventThread controllerId=1001] Shutdown completed (kafka.controller.ControllerEventManager$ControllerEventThread)`  of maybe something is wrong with the Kafka controller ? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Kafka
			
    
	
		
		
		03-20-2024
	
		
		11:46 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 thank you for response    but look on that also    2024-03-18 19:31:52,673 WARN datanode.DataNode (BlockReceiver.java:receivePacket(701)) - Slow BlockReceiver write data to disk cost:756ms (threshold=300ms), volume=/data/sde/hadoop/hdfs/data  2024-03-18 19:35:15,334 WARN datanode.DataNode (BlockReceiver.java:receivePacket(701)) - Slow BlockReceiver write data to disk cost:377ms (threshold=300ms), volume=/data/sdc/hadoop/hdfs/data  2024-03-18 19:51:57,774 WARN datanode.DataNode (BlockReceiver.java:receivePacket(701)) - Slow BlockReceiver write data to disk cost:375ms (threshold=300ms), volume=/data/sdb/hadoop/hdfs/data    As you can see the warning is also on local disks not only across the network  In any case we already checked the network include the switches and we not found a problem  Do you think its could be tuning issue in hdfs parameters or some parameters that can help    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-19-2024
	
		
		07:12 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 We have Hadoop cluster with `487` data-nodes machines ( each data-node machine include also the Service node-manager ) , all machines are physical machines ( DELL ) , and OS is RHEL 7.9 version.  Each data-node machine have 12 disks, each disk is with size of 12T  Hadoop cluster type installed from HDP packages ( previously was under Horton-works and now under Cloudera )  Users are complain about slowness from spark applications that run on data-nodes machines  And after investigation we seen the following warning from data-node logs    2024-03-18 17:41:30,230 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 401ms (threshold=300ms), downstream DNs=[172.87.171.24:50010, 172.87.171.23:50010]  2024-03-18 17:41:49,795 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 410ms (threshold=300ms), downstream DNs=[172.87.171.26:50010, 172.87.171.31:50010]  2024-03-18 18:06:29,585 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 303ms (threshold=300ms), downstream DNs=[172.87.171.34:50010, 172.87.171.22:50010]  2024-03-18 18:18:55,931 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 729ms (threshold=300ms), downstream DNs=[172.87.11.27:50010]  from above log we can see the `warning Slow BlockReceiver write packet to mirror took xxms` and also the data-nodes machines as `172.87.171.23,172.87.171.24` etc.  from my understanding the exceptions as Slow `BlockReceiver write packet to mirror` indicate maybe on delay in writing the block to OS cache or disk  So I am trying to collect the reasons for this warning / exceptions , and here there are  1. delay in writing the block to OS cache or disk  2. cluster is as or near its resources limit ( memory , CPU or disk )  3. network issues between machines    From my verification I not see **disk** or **CPU** or **memory** problem , we checked all machines  From network side I not see special issues that relevant to machines itself  And we also used the iperf3 ro check the Bandwidth between one machine to other.  here is example between `data-node01` to `data-node03` ( from my understanding and please Correct me if I am wrong looks like Bandwidth is ok )  From data-node01  iperf3 -i 10 -s  [ ID] Interval Transfer Bandwidth  [ 5] 0.00-10.00 sec 7.90 GBytes 6.78 Gbits/sec  [ 5] 10.00-20.00 sec 8.21 GBytes 7.05 Gbits/sec  [ 5] 20.00-30.00 sec 7.25 GBytes 6.23 Gbits/sec  [ 5] 30.00-40.00 sec 7.16 GBytes 6.15 Gbits/sec  [ 5] 40.00-50.00 sec 7.08 GBytes 6.08 Gbits/sec  [ 5] 50.00-60.00 sec 6.27 GBytes 5.39 Gbits/sec  [ 5] 60.00-60.04 sec 35.4 MBytes 7.51 Gbits/sec  - - - - - - - - - - - - - - - - - - - - - - - - -  [ ID] Interval Transfer Bandwidth  [ 5] 0.00-60.04 sec 0.00 Bytes 0.00 bits/sec sender  [ 5] 0.00-60.04 sec 43.9 GBytes 6.28 Gbits/sec receiver    From data-node03  iperf3 -i 1 -t 60 -c 172.87.171.84  [ ID] Interval Transfer Bandwidth Retr Cwnd  [ 4] 0.00-1.00 sec 792 MBytes 6.64 Gbits/sec 0 3.02 MBytes  [ 4] 1.00-2.00 sec 834 MBytes 6.99 Gbits/sec 54 2.26 MBytes  [ 4] 2.00-3.00 sec 960 MBytes 8.05 Gbits/sec 0 2.49 MBytes  [ 4] 3.00-4.00 sec 896 MBytes 7.52 Gbits/sec 0 2.62 MBytes  [ 4] 4.00-5.00 sec 790 MBytes 6.63 Gbits/sec 0 2.70 MBytes  [ 4] 5.00-6.00 sec 838 MBytes 7.03 Gbits/sec 4 1.97 MBytes  [ 4] 6.00-7.00 sec 816 MBytes 6.85 Gbits/sec 0 2.17 MBytes  [ 4] 7.00-8.00 sec 728 MBytes 6.10 Gbits/sec 0 2.37 MBytes  [ 4] 8.00-9.00 sec 692 MBytes 5.81 Gbits/sec 47 1.74 MBytes  [ 4] 9.00-10.00 sec 778 MBytes 6.52 Gbits/sec 0 1.91 MBytes  [ 4] 10.00-11.00 sec 785 MBytes 6.58 Gbits/sec 48 1.57 MBytes  [ 4] 11.00-12.00 sec 861 MBytes 7.23 Gbits/sec 0 1.84 MBytes  [ 4] 12.00-13.00 sec 844 MBytes 7.08 Gbits/sec 0 1.96 MBytes  Note - Nic card/s are with `10G` speed ( we checked this by ethtool )  We also checked the firmware-version of the NIC card  ethtool -i p1p1  driver: i40e  version: 2.8.20-k  firmware-version: 8.40 0x8000af82 20.5.13  expansion-rom-version:  bus-info: 0000:3b:00.0  supports-statistics: yes  supports-test: yes  supports-eeprom-access: yes  supports-register-dump: yes  supports-priv-flags: yes  We also checked from kernel messages ( `dmesg` ) but no seen something special. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			HDFS
			
    
	
		
		
		02-21-2024
	
		
		03:30 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 we have Hadoop cluster with active/stand by resource manager services the active resource manager is on master1 machine and the stand by resource manager is on master2 machine  in our cluster YARN service that include both resource manager services is managing 276 node manager component on workers machines  from Ambari WEB UI alerts ( Alerts for Resource Manager ) we notice about the following  Resource Manager Web UI  Connection failed to http://master2.jupiter.com:8088(timed out)    we start to debug the issue by wget with port 8088 , and we found that process is hang on - HTTP request sent, `awaiting response... No data received`.  example from resource manager machine  wget --debug http://master2.jupiter.com:8088  DEBUG output created by Wget 1.14 on Linux-gnu.    URI encoding = ‘UTF-8’  Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)  Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)  --2024-02-21 10:13:42-- http://master2` .jupiter.com:8088/  Resolving master2.jupiter.com (master2.jupiter.com)... 192.9.201.169  Caching master2.jupiter.com => 192.9.201.169  Connecting to master2.jupiter.com (master2.jupiter.com)|192.9.201.169|:8088... connected.  Created socket 3.  Releasing 0x0000000000a0da00 (new refcount 1).    ---request begin---  GET / HTTP/1.1  User-Agent: Wget/1.14 (linux-gnu)  Accept: */*  Host: master2.jupiter.com:8088  Connection: Keep-Alive    ---request end---  HTTP request sent, awaiting response...  ---response begin---  HTTP/1.1 307 TEMPORARY_REDIRECT  Cache-Control: no-cache  Expires: Wed, 21 Feb 2024 10:13:42 GMT  Date: Wed, 21 Feb 2024 10:13:42 GMT  Pragma: no-cache  Expires: Wed, 21 Feb 2024 10:13:42 GMT  Date: Wed, 21 Feb 2024 10:13:42 GMT  Pragma: no-cache  Content-Type: text/plain; charset=UTF-8  X-Frame-Options: SAMEORIGIN  Location: http://master1.jupiter.com:8088/  Content-Length: 43  Server: Jetty(6.1.26.hwx)    ---response end---  307 TEMPORARY_REDIRECT  Registered socket 3 for persistent reuse.  URI content encoding = ‘UTF-8’  Location: http://master1.jupiter.com:8088/ [following]  Skipping 43 bytes of body: [This is standby RM. The redirect url is: /  ] done.  URI content encoding = None  Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)  Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)  --2024-02-21 10:13:42-- http://master1.jupiter.com:8088/  conaddr is: 192.9.201.169  Resolving master1.jupiter.com (master1.jupiter.com)... 192.9.66.14  Caching master1.jupiter.com => 192.9.66.14  Releasing 0x0000000000a0f320 (new refcount 1).  Found master1.jupiter.com in host_name_addresses_map (0xa0f320)  Connecting to master1.jupiter.com (master1.jupiter.com)|192.9.66.14|:8088... connected.  Created socket 4.  Releasing 0x0000000000a0f320 (new refcount 1).  .  .  .    ---response end---  302 Found  Disabling further reuse of socket 3.  Closed fd 3  Registered socket 4 for persistent reuse.  URI content encoding = ‘UTF-8’  Location: http://master1.jupiter.com:8088/cluster [following]  ] done.  URI content encoding = None  Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)  Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)  --2024-02-21 10:27:07-- http://master1.jupiter.com:8088/cluster  Reusing existing connection to master1.jupiter.com:8088.  Reusing fd 4.    ---request begin---  GET /cluster HTTP/1.1  User-Agent: Wget/1.14 (linux-gnu)  Accept: */*  Host: master1.jupiter.com:8088  Connection: Keep-Alive    ---request end---  HTTP request sent, awaiting response...  ---response begin---  HTTP/1.1 200 OK  Cache-Control: no-cache  Expires: Wed, 21 Feb 2024 10:30:23 GMT  Date: Wed, 21 Feb 2024 10:30:23 GMT  Pragma: no-cache  Expires: Wed, 21 Feb 2024 10:30:23 GMT  Date: Wed, 21 Feb 2024 10:30:23 GMT  Pragma: no-cache  Content-Type: text/html; charset=utf-8  X-Frame-Options: SAMEORIGIN  Transfer-Encoding: chunked  Server: Jetty(6.1.26.hwx)    ---response end---  200 OK  URI content encoding = ‘utf-8’  Length: unspecified [text/html]  Saving to: ‘index.html’    [ <=> ] 1,018,917 --.-K/s in 0.04s    2024-02-21 10:31:31 (24.0 MB/s) - ‘index.html’ saved [1018917]    as we can see above wget completed after very long time around ~ 20 min instead to completed the process in one or two second    we can take tcpdump as  tcpdump -vv -s0 tcp port 8088 -w /tmp/why_8088_hang.pcap    but I want to understand if there are better simple ways to understand why we get HTTP request sent, awaiting response... , and maybe its related to resource manager service 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache YARN
			
    
	
		
		
		02-15-2024
	
		
		09:02 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 We have HDP cluster with 152 workers machines - `worker1.duplex.com` .. `worker152.duplex.com` , While all machines are installed on RHEL 7.9 version    We are trying to delete the last host - `worker152.duplex.com` from Ambari server or actually from PostgreSQL DB as the following    First we need to find the `host_id`  select host_id from hosts where host_name='worker152.duplex.com';    and host_id is:    host_id  ---------  51  (1 row)    Now we are deletion this `host_id` - 51  delete from execution_command where task_id in (select task_id from host_role_command where host_id in (51));  delete from host_version where host_id in (51);  delete from host_role_command where host_id in (51);  delete from serviceconfighosts where host_id in (51);  delete from hoststate where host_id in (51);  delete from kerberos_principal_host WHERE host_id='worker152.duplex.com';  delete from hosts where host_name in ('worker152.duplex.com');  delete from alert_current where history_id in ( select alert_id from alert_history where host_name in ('worker152.duplex.com'));    Now we verify that `host_id` - 51 that represented the host - `worker152.duplex.com` isn't exists By the following verification    ambari=> select host_name, public_host_name from hosts;  host_name | public_host_name  --------------------------+--------------------------  worker1.duplex.com  .  .  .  worker151.duplex.com  As we can see above the host `worker151.duplex.com` not exist and that's fine , and indeed seems That host - `worker151.duplex.com` was deleted from PostgreSQL DB    Now we restarting the `Ambari-server` in order to take affect ( its also restart the PostgreSQL service )  ambari-server restart  Using python /usr/bin/python  Restarting ambari-server  Waiting for server stop...  Ambari Server stopped  Ambari Server running with administrator privileges.  Organizing resource files at /var/lib/ambari-server/resources...  Ambari database consistency check started...  Server PID at: /var/run/ambari-server/ambari-server.pid  Server out at: /var/log/ambari-server/ambari-server.out  Server log at: /var/log/ambari-server/ambari-server.log  Waiting for server start.........................  Server started listening on 8080    DB configs consistency check: no errors and warnings were found.    After Ambari server started , we are surprised because the `host_id` - 51 or host - `worker152.duplex.com` , is still exist as the following  ambari=> select host_name, public_host_name from hosts;  host_name | public_host_name  --------------------------+--------------------------  worker1.duplex.com  .  .  .  worker152.duplex.com    We not understand why this host back again in spite we delete this record  We also tried to delete historical data by the following but this isn't help    ambari-server db-purge-history --cluster-name hadoop7 --from-date 2024-01-01  Using python /usr/bin/python  Purge database history...  Ambari Server configured for Embedded Postgres. Confirm you have made a backup of the Ambari Server database [y/n]yes  ERROR: The database purge historical data cannot proceed while Ambari Server is running. Please shut down Ambari first.  Ambari Server 'db-purge-history' completed successfully.     1. Why host returned after `Ambari-server` restart ?  2. what is wrong with out deletion process?  PostgreSQL Version:  postgres=# SHOW server_version;  server_version  ----------------  9.2.24  (1 row)     links:  https://www.andruffsolutions.com/removing-old-host-data-from-ambari-server-and-tuning-the-database/  https://community.cloudera.com/t5/Support-Questions/how-to-remove-old-registered-hosts-from-DB/m-p/217524/highlight/true    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Hortonworks Data Platform (HDP)
			
    
	
		
		
		02-04-2024
	
		
		10:59 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 you can balance the data-node disks usage by decommission and recommission , but if you have only 2 data-nodes then its a problem  better to do it at least 3 data-nodes in cluster  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-04-2024
	
		
		10:43 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 lets say I copy the fsimage from active to standby namenode  and then still we have a problem to start the namenode then  can I do the steps as already mentioned?  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-03-2024
	
		
		02:20 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 we have HDP Hadoop cluster with two name-node services ( one active name-node and the secondary is the standby name-node )  due of unexpected electricity failure , the standby name-node failed to start with the flowing exception , while the active name-node starting successfully  2024-02-02 08:47:11,497 INFO common.Storage (Storage.java:tryLock(776)) - Lock on /hadoop/hdfs/namenode/in_use.lock acquired by nodename 36146@master1.delax.com  2024-02-02 08:47:11,891 INFO namenode.FSImage (FSImage.java:loadFSImageFile(745)) - Planning to load image: FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141)  2024-02-02 08:47:11,897 ERROR namenode.FSImage (FSImage.java:loadFSImage(693)) - Failed to load image from FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141)  java.io.IOException: Premature EOF from inputStream  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204)  at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:221)  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:898)  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:882)  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:755)  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:686)  at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303)  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1077)  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)  at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:697)  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:761)  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1001)  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:985)  at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1710)  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1778)  2024-02-02 08:47:12,238 WARN namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(726)) - Encountered exception loading fsimage  java.io.IOException: Failed to load FSImage file, see error(s) above for more info.    we can see from above exception - `Failed to load image from FSImageFile` , and seems it is as results of when machine failed because unexpected shutdown  as I understand one of the options to recover the standby name-node could be with the following procedure:    1. Put Active NN in safemode  sudo -u hdfs hdfs dfsadmin -safemode enter  2. Do a savenamespace operation on Active NN    sudo -u hdfs hdfs dfsadmin -saveNamespace  3. Leave Safemode  sudo -u hdfs hdfs dfsadmin -safemode leave  4. Login to Standby NN  5. Run below command on Standby namenode to get latest fsimage that we saved in above steps.  sudo -u hdfs hdfs namenode -bootstrapStandby -force    we glad to receive any suggestions , or if my above suggestion is good enough for our problem 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			HDFS
- 
						
							
		
			Hortonworks Data Platform (HDP)
			
    
	
		
		
		02-03-2024
	
		
		02:17 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 is the following procedure can help also?    Put Active NN in safemode  sudo -u hdfs hdfs dfsadmin -safemode enter    Do a savenamespace operation on Active NN  sudo -u hdfs hdfs dfsadmin -saveNamespace    Leave Safemode  sudo -u hdfs hdfs dfsadmin -safemode leave    Login to Standby NN    Run below command on Standby namenode to get latest fsimage that we saved in above steps.  sudo -u hdfs hdfs namenode -bootstrapStandby -force      
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-22-2023
	
		
		08:39 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 we have HDP cluster version 2.6.5     when we look on name-node logs we can see the following warning     2023-02-20 15:58:31,377 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction     2023-02-20 16:00:39,037 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction 
2023-02-20 16:01:43,962 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 193594954980-193594954980 took 1329ms
2023-02-20 16:02:47,129 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms
2023-02-20 16:03:52,763 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595106645-193595106646 took 1344ms
2023-02-20 16:04:56,276 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595175233-193595175233 took 1678ms
2023-02-20 16:06:01,067 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595252052-193595252052 took 1265ms
2023-02-20 16:07:06,447 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595320796-193595320796 took 1273ms           in our HDP cluster , HDFS service include 2 name-node services and 3 journal-Nodes cluster include 736 data nodes machines , and HDFS service is the manager of all data-node     we want to understand what is the reason for the following warning ?     and how to avoid this messages by proactive solution  server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms       
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Ambari Blueprints
 
         
					
				













