Member since
08-08-2017
1652
Posts
30
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1484 | 06-15-2020 05:23 AM | |
9575 | 01-30-2020 08:04 PM | |
1639 | 07-07-2019 09:06 PM | |
6804 | 01-27-2018 10:17 PM | |
3824 | 12-31-2017 10:12 PM |
08-20-2024
02:37 PM
we have cluster with 12 Kafka machines and 3 zookeeper servers on Linux servers - Kafka version is 2.7 version , ( broker and controller are co-hosted on the same PID ) as known Kafka have 2 important logs and they are **server.log** and **controller.log** about **controller.log** , when we look on this log we can see the following words - "`Shutdown completed`" in the log [2024-08-20 21:42:01,582] INFO [ControllerEventThread controllerId=1001] Shutdown completed (kafka.controller.ControllerEventManager$ControllerEventThread) the first thinking when we see the messages about "`Shutdown completed`" - is like this message is "bad" and why controller stopped ... but when we look on all machines most of the machines have this message as - `Shutdown completed` (`kafka.controller.ControllerEventManager$ControllerEventThread`) **but on other hand** only one controller should be active from all brokers and maybe the messages as "`Shutdown completed`" are only indicate that controllers that are not active are in standby state and therefore are in state of - `Shutdown completed` ? for example - here one of the log from one broker machine [2024-08-20 21:23:18,084] DEBUG [Controller id=1001] Broker 1007 was elected as controller instead of broker 1001 (kafka.controller.KafkaController) org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting controller startup procedure [2024-08-20 21:33:51,281] DEBUG [Controller id=1001] Broker 1005 was elected as controller instead of broker 1001 (kafka.controller.KafkaController) org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting controller startup procedure [2024-08-20 21:42:01,581] INFO [ControllerEventThread controllerId=1001] Shutting down (kafka.controller.ControllerEventManager$ControllerEventThread) [2024-08-20 21:42:01,582] INFO [ControllerEventThread controllerId=1001] Shutdown completed (kafka.controller.ControllerEventManager$ControllerEventThread) [2024-08-20 21:42:01,582] INFO [ControllerEventThread controllerId=1001] Stopped (kafka.controller.ControllerEventManager$ControllerEventThread) [2024-08-20 21:42:01,582] DEBUG [Controller id=1001] Resigning (kafka.controller.KafkaController) [2024-08-20 21:42:01,583] DEBUG [Controller id=1001] Unregister BrokerModifications handler for Set() (kafka.controller.KafkaController) [2024-08-20 21:42:01,604] INFO [PartitionStateMachine controllerId=1001] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine) [2024-08-20 21:42:01,608] INFO [ReplicaStateMachine controllerId=1001] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine) [2024-08-20 21:42:01,608] INFO [Controller id=1001] Resigned (kafka.controller.KafkaController) [2024-08-20 21:43:45,196] INFO [ControllerEventThread controllerId=1001] Starting (kafka.controller.ControllerEventManager$ControllerEventThread) [2024-08-20 21:43:45,208] DEBUG [Controller id=1001] Broker 1005 has been elected as the controller, so stopping the election process. (kafka.controller.KafkaController) [2024-08-20 21:52:28,400] DEBUG [Controller id=1001] Broker 1001 was elected as controller instead of broker 1001 (kafka.controller.KafkaController) org.apache.kafka.common.errors.ControllerMovedException: Controller moved to another broker. Aborting controller startup procedure <---- LOG IS ENDED HERE so the question is - can we ignore the messages as `INFO [ControllerEventThread controllerId=1001] Shutdown completed (kafka.controller.ControllerEventManager$ControllerEventThread)` of maybe something is wrong with the Kafka controller ?
... View more
Labels:
- Labels:
-
Apache Kafka
03-20-2024
11:46 PM
1 Kudo
thank you for response but look on that also 2024-03-18 19:31:52,673 WARN datanode.DataNode (BlockReceiver.java:receivePacket(701)) - Slow BlockReceiver write data to disk cost:756ms (threshold=300ms), volume=/data/sde/hadoop/hdfs/data 2024-03-18 19:35:15,334 WARN datanode.DataNode (BlockReceiver.java:receivePacket(701)) - Slow BlockReceiver write data to disk cost:377ms (threshold=300ms), volume=/data/sdc/hadoop/hdfs/data 2024-03-18 19:51:57,774 WARN datanode.DataNode (BlockReceiver.java:receivePacket(701)) - Slow BlockReceiver write data to disk cost:375ms (threshold=300ms), volume=/data/sdb/hadoop/hdfs/data As you can see the warning is also on local disks not only across the network In any case we already checked the network include the switches and we not found a problem Do you think its could be tuning issue in hdfs parameters or some parameters that can help
... View more
03-19-2024
07:12 AM
We have Hadoop cluster with `487` data-nodes machines ( each data-node machine include also the Service node-manager ) , all machines are physical machines ( DELL ) , and OS is RHEL 7.9 version. Each data-node machine have 12 disks, each disk is with size of 12T Hadoop cluster type installed from HDP packages ( previously was under Horton-works and now under Cloudera ) Users are complain about slowness from spark applications that run on data-nodes machines And after investigation we seen the following warning from data-node logs 2024-03-18 17:41:30,230 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 401ms (threshold=300ms), downstream DNs=[172.87.171.24:50010, 172.87.171.23:50010] 2024-03-18 17:41:49,795 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 410ms (threshold=300ms), downstream DNs=[172.87.171.26:50010, 172.87.171.31:50010] 2024-03-18 18:06:29,585 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 303ms (threshold=300ms), downstream DNs=[172.87.171.34:50010, 172.87.171.22:50010] 2024-03-18 18:18:55,931 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 729ms (threshold=300ms), downstream DNs=[172.87.11.27:50010] from above log we can see the `warning Slow BlockReceiver write packet to mirror took xxms` and also the data-nodes machines as `172.87.171.23,172.87.171.24` etc. from my understanding the exceptions as Slow `BlockReceiver write packet to mirror` indicate maybe on delay in writing the block to OS cache or disk So I am trying to collect the reasons for this warning / exceptions , and here there are 1. delay in writing the block to OS cache or disk 2. cluster is as or near its resources limit ( memory , CPU or disk ) 3. network issues between machines From my verification I not see **disk** or **CPU** or **memory** problem , we checked all machines From network side I not see special issues that relevant to machines itself And we also used the iperf3 ro check the Bandwidth between one machine to other. here is example between `data-node01` to `data-node03` ( from my understanding and please Correct me if I am wrong looks like Bandwidth is ok ) From data-node01 iperf3 -i 10 -s [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.00 sec 7.90 GBytes 6.78 Gbits/sec [ 5] 10.00-20.00 sec 8.21 GBytes 7.05 Gbits/sec [ 5] 20.00-30.00 sec 7.25 GBytes 6.23 Gbits/sec [ 5] 30.00-40.00 sec 7.16 GBytes 6.15 Gbits/sec [ 5] 40.00-50.00 sec 7.08 GBytes 6.08 Gbits/sec [ 5] 50.00-60.00 sec 6.27 GBytes 5.39 Gbits/sec [ 5] 60.00-60.04 sec 35.4 MBytes 7.51 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-60.04 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-60.04 sec 43.9 GBytes 6.28 Gbits/sec receiver From data-node03 iperf3 -i 1 -t 60 -c 172.87.171.84 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 792 MBytes 6.64 Gbits/sec 0 3.02 MBytes [ 4] 1.00-2.00 sec 834 MBytes 6.99 Gbits/sec 54 2.26 MBytes [ 4] 2.00-3.00 sec 960 MBytes 8.05 Gbits/sec 0 2.49 MBytes [ 4] 3.00-4.00 sec 896 MBytes 7.52 Gbits/sec 0 2.62 MBytes [ 4] 4.00-5.00 sec 790 MBytes 6.63 Gbits/sec 0 2.70 MBytes [ 4] 5.00-6.00 sec 838 MBytes 7.03 Gbits/sec 4 1.97 MBytes [ 4] 6.00-7.00 sec 816 MBytes 6.85 Gbits/sec 0 2.17 MBytes [ 4] 7.00-8.00 sec 728 MBytes 6.10 Gbits/sec 0 2.37 MBytes [ 4] 8.00-9.00 sec 692 MBytes 5.81 Gbits/sec 47 1.74 MBytes [ 4] 9.00-10.00 sec 778 MBytes 6.52 Gbits/sec 0 1.91 MBytes [ 4] 10.00-11.00 sec 785 MBytes 6.58 Gbits/sec 48 1.57 MBytes [ 4] 11.00-12.00 sec 861 MBytes 7.23 Gbits/sec 0 1.84 MBytes [ 4] 12.00-13.00 sec 844 MBytes 7.08 Gbits/sec 0 1.96 MBytes Note - Nic card/s are with `10G` speed ( we checked this by ethtool ) We also checked the firmware-version of the NIC card ethtool -i p1p1 driver: i40e version: 2.8.20-k firmware-version: 8.40 0x8000af82 20.5.13 expansion-rom-version: bus-info: 0000:3b:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes We also checked from kernel messages ( `dmesg` ) but no seen something special.
... View more
Labels:
- Labels:
-
HDFS
02-21-2024
03:30 AM
1 Kudo
we have Hadoop cluster with active/stand by resource manager services the active resource manager is on master1 machine and the stand by resource manager is on master2 machine in our cluster YARN service that include both resource manager services is managing 276 node manager component on workers machines from Ambari WEB UI alerts ( Alerts for Resource Manager ) we notice about the following Resource Manager Web UI Connection failed to http://master2.jupiter.com:8088(timed out) we start to debug the issue by wget with port 8088 , and we found that process is hang on - HTTP request sent, `awaiting response... No data received`. example from resource manager machine wget --debug http://master2.jupiter.com:8088 DEBUG output created by Wget 1.14 on Linux-gnu. URI encoding = ‘UTF-8’ Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) --2024-02-21 10:13:42-- http://master2` .jupiter.com:8088/ Resolving master2.jupiter.com (master2.jupiter.com)... 192.9.201.169 Caching master2.jupiter.com => 192.9.201.169 Connecting to master2.jupiter.com (master2.jupiter.com)|192.9.201.169|:8088... connected. Created socket 3. Releasing 0x0000000000a0da00 (new refcount 1). ---request begin--- GET / HTTP/1.1 User-Agent: Wget/1.14 (linux-gnu) Accept: */* Host: master2.jupiter.com:8088 Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 307 TEMPORARY_REDIRECT Cache-Control: no-cache Expires: Wed, 21 Feb 2024 10:13:42 GMT Date: Wed, 21 Feb 2024 10:13:42 GMT Pragma: no-cache Expires: Wed, 21 Feb 2024 10:13:42 GMT Date: Wed, 21 Feb 2024 10:13:42 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 X-Frame-Options: SAMEORIGIN Location: http://master1.jupiter.com:8088/ Content-Length: 43 Server: Jetty(6.1.26.hwx) ---response end--- 307 TEMPORARY_REDIRECT Registered socket 3 for persistent reuse. URI content encoding = ‘UTF-8’ Location: http://master1.jupiter.com:8088/ [following] Skipping 43 bytes of body: [This is standby RM. The redirect url is: / ] done. URI content encoding = None Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) --2024-02-21 10:13:42-- http://master1.jupiter.com:8088/ conaddr is: 192.9.201.169 Resolving master1.jupiter.com (master1.jupiter.com)... 192.9.66.14 Caching master1.jupiter.com => 192.9.66.14 Releasing 0x0000000000a0f320 (new refcount 1). Found master1.jupiter.com in host_name_addresses_map (0xa0f320) Connecting to master1.jupiter.com (master1.jupiter.com)|192.9.66.14|:8088... connected. Created socket 4. Releasing 0x0000000000a0f320 (new refcount 1). . . . ---response end--- 302 Found Disabling further reuse of socket 3. Closed fd 3 Registered socket 4 for persistent reuse. URI content encoding = ‘UTF-8’ Location: http://master1.jupiter.com:8088/cluster [following] ] done. URI content encoding = None Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) --2024-02-21 10:27:07-- http://master1.jupiter.com:8088/cluster Reusing existing connection to master1.jupiter.com:8088. Reusing fd 4. ---request begin--- GET /cluster HTTP/1.1 User-Agent: Wget/1.14 (linux-gnu) Accept: */* Host: master1.jupiter.com:8088 Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK Cache-Control: no-cache Expires: Wed, 21 Feb 2024 10:30:23 GMT Date: Wed, 21 Feb 2024 10:30:23 GMT Pragma: no-cache Expires: Wed, 21 Feb 2024 10:30:23 GMT Date: Wed, 21 Feb 2024 10:30:23 GMT Pragma: no-cache Content-Type: text/html; charset=utf-8 X-Frame-Options: SAMEORIGIN Transfer-Encoding: chunked Server: Jetty(6.1.26.hwx) ---response end--- 200 OK URI content encoding = ‘utf-8’ Length: unspecified [text/html] Saving to: ‘index.html’ [ <=> ] 1,018,917 --.-K/s in 0.04s 2024-02-21 10:31:31 (24.0 MB/s) - ‘index.html’ saved [1018917] as we can see above wget completed after very long time around ~ 20 min instead to completed the process in one or two second we can take tcpdump as tcpdump -vv -s0 tcp port 8088 -w /tmp/why_8088_hang.pcap but I want to understand if there are better simple ways to understand why we get HTTP request sent, awaiting response... , and maybe its related to resource manager service
... View more
Labels:
- Labels:
-
Apache YARN
02-15-2024
09:02 AM
1 Kudo
We have HDP cluster with 152 workers machines - `worker1.duplex.com` .. `worker152.duplex.com` , While all machines are installed on RHEL 7.9 version We are trying to delete the last host - `worker152.duplex.com` from Ambari server or actually from PostgreSQL DB as the following First we need to find the `host_id` select host_id from hosts where host_name='worker152.duplex.com'; and host_id is: host_id --------- 51 (1 row) Now we are deletion this `host_id` - 51 delete from execution_command where task_id in (select task_id from host_role_command where host_id in (51)); delete from host_version where host_id in (51); delete from host_role_command where host_id in (51); delete from serviceconfighosts where host_id in (51); delete from hoststate where host_id in (51); delete from kerberos_principal_host WHERE host_id='worker152.duplex.com'; delete from hosts where host_name in ('worker152.duplex.com'); delete from alert_current where history_id in ( select alert_id from alert_history where host_name in ('worker152.duplex.com')); Now we verify that `host_id` - 51 that represented the host - `worker152.duplex.com` isn't exists By the following verification ambari=> select host_name, public_host_name from hosts; host_name | public_host_name --------------------------+-------------------------- worker1.duplex.com . . . worker151.duplex.com As we can see above the host `worker151.duplex.com` not exist and that's fine , and indeed seems That host - `worker151.duplex.com` was deleted from PostgreSQL DB Now we restarting the `Ambari-server` in order to take affect ( its also restart the PostgreSQL service ) ambari-server restart Using python /usr/bin/python Restarting ambari-server Waiting for server stop... Ambari Server stopped Ambari Server running with administrator privileges. Organizing resource files at /var/lib/ambari-server/resources... Ambari database consistency check started... Server PID at: /var/run/ambari-server/ambari-server.pid Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Waiting for server start......................... Server started listening on 8080 DB configs consistency check: no errors and warnings were found. After Ambari server started , we are surprised because the `host_id` - 51 or host - `worker152.duplex.com` , is still exist as the following ambari=> select host_name, public_host_name from hosts; host_name | public_host_name --------------------------+-------------------------- worker1.duplex.com . . . worker152.duplex.com We not understand why this host back again in spite we delete this record We also tried to delete historical data by the following but this isn't help ambari-server db-purge-history --cluster-name hadoop7 --from-date 2024-01-01 Using python /usr/bin/python Purge database history... Ambari Server configured for Embedded Postgres. Confirm you have made a backup of the Ambari Server database [y/n]yes ERROR: The database purge historical data cannot proceed while Ambari Server is running. Please shut down Ambari first. Ambari Server 'db-purge-history' completed successfully. 1. Why host returned after `Ambari-server` restart ? 2. what is wrong with out deletion process? PostgreSQL Version: postgres=# SHOW server_version; server_version ---------------- 9.2.24 (1 row) links: https://www.andruffsolutions.com/removing-old-host-data-from-ambari-server-and-tuning-the-database/ https://community.cloudera.com/t5/Support-Questions/how-to-remove-old-registered-hosts-from-DB/m-p/217524/highlight/true
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-04-2024
10:59 AM
1 Kudo
you can balance the data-node disks usage by decommission and recommission , but if you have only 2 data-nodes then its a problem better to do it at least 3 data-nodes in cluster
... View more
02-04-2024
10:43 AM
1 Kudo
lets say I copy the fsimage from active to standby namenode and then still we have a problem to start the namenode then can I do the steps as already mentioned?
... View more
02-03-2024
02:20 PM
1 Kudo
we have HDP Hadoop cluster with two name-node services ( one active name-node and the secondary is the standby name-node ) due of unexpected electricity failure , the standby name-node failed to start with the flowing exception , while the active name-node starting successfully 2024-02-02 08:47:11,497 INFO common.Storage (Storage.java:tryLock(776)) - Lock on /hadoop/hdfs/namenode/in_use.lock acquired by nodename 36146@master1.delax.com 2024-02-02 08:47:11,891 INFO namenode.FSImage (FSImage.java:loadFSImageFile(745)) - Planning to load image: FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141) 2024-02-02 08:47:11,897 ERROR namenode.FSImage (FSImage.java:loadFSImage(693)) - Failed to load image from FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141) java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:221) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:882) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:755) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:686) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1077) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:697) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:761) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1001) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:985) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1710) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1778) 2024-02-02 08:47:12,238 WARN namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(726)) - Encountered exception loading fsimage java.io.IOException: Failed to load FSImage file, see error(s) above for more info. we can see from above exception - `Failed to load image from FSImageFile` , and seems it is as results of when machine failed because unexpected shutdown as I understand one of the options to recover the standby name-node could be with the following procedure: 1. Put Active NN in safemode sudo -u hdfs hdfs dfsadmin -safemode enter 2. Do a savenamespace operation on Active NN sudo -u hdfs hdfs dfsadmin -saveNamespace 3. Leave Safemode sudo -u hdfs hdfs dfsadmin -safemode leave 4. Login to Standby NN 5. Run below command on Standby namenode to get latest fsimage that we saved in above steps. sudo -u hdfs hdfs namenode -bootstrapStandby -force we glad to receive any suggestions , or if my above suggestion is good enough for our problem
... View more
Labels:
- Labels:
-
HDFS
-
Hortonworks Data Platform (HDP)
02-03-2024
02:17 PM
2 Kudos
is the following procedure can help also? Put Active NN in safemode sudo -u hdfs hdfs dfsadmin -safemode enter Do a savenamespace operation on Active NN sudo -u hdfs hdfs dfsadmin -saveNamespace Leave Safemode sudo -u hdfs hdfs dfsadmin -safemode leave Login to Standby NN Run below command on Standby namenode to get latest fsimage that we saved in above steps. sudo -u hdfs hdfs namenode -bootstrapStandby -force
... View more
02-22-2023
08:39 AM
we have HDP cluster version 2.6.5 when we look on name-node logs we can see the following warning 2023-02-20 15:58:31,377 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction 2023-02-20 16:00:39,037 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction
2023-02-20 16:01:43,962 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193594954980-193594954980 took 1329ms
2023-02-20 16:02:47,129 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms
2023-02-20 16:03:52,763 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595106645-193595106646 took 1344ms
2023-02-20 16:04:56,276 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595175233-193595175233 took 1678ms
2023-02-20 16:06:01,067 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595252052-193595252052 took 1265ms
2023-02-20 16:07:06,447 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595320796-193595320796 took 1273ms in our HDP cluster , HDFS service include 2 name-node services and 3 journal-Nodes cluster include 736 data nodes machines , and HDFS service is the manager of all data-node we want to understand what is the reason for the following warning ? and how to avoid this messages by proactive solution server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms
... View more
Labels:
- Labels:
-
Ambari Blueprints