About mike_bronson7

mike_bronson7 · ‎01-11-2023

Dear @Shelton , long time that we not meet , glad to see you again back to my Question , since we are talking on node manager , my goal is to avoid cases like node-manager service is die or not sync with the resource manager , please forgive me but I not understand why you talking about data node and exclude data node from the cluster , because the question is on different subject , and as I mention we want to understand the root cause of lost node manager and how to do proactive steps in order to avoid such of this problems additionally as I understand most of this problems are as results of bad network that break the connectivity between node manager to resources manager , so in spite some times this behavior is happening , we are trying to set the configuration that give the cluster to be stable in spite all networking problems or INFA problems let me know if my question is clear so we can continue with our discussion , and sorry again if my first post was not clearly

mike_bronson7 · ‎01-11-2023

we have huge production Hadoop cluster, with HDP version 2.6.5 and ambari version 2.6.2.2 , and all machines are with OS RHEL 7.6 version the cluster size is as the following : Total workers machines - 425 ( each worker include data node and node manager service ) from time to time we get indication of lost one or two **node-manager** and this identified from Ambari as ( 424/425 when total node-manager are 425 ) in order to fix it we just restart the **node-manager** and this action fix the problem and as results we get 425/425 after some googling , we found the following parameters that maybe should be tune better yarn.client.nodemanager-connect.max-wait-ms ( its configured to 60000 ms and we think to increase it ) yarn.client.nodemanager-connect.retry-interval-ms ( its configured to 10 sec ms and we think to increase it ) yarn.nm.liveness-monitor.expiry-interval-ms ( this parameter not configured yet and we think to add it with value of 1500000 ms ) based on above details , I will appreciate to get comments or others ideas background: NodeManager is LOST means that ResourceManager haven't received heartbeats from it for a duration of nm.liveness-monitor.expiry-interval-ms milliseconds (default is 10 minutes).

mike_bronson7 · ‎12-04-2022

we want to find the **approach / test /cli / API** that gives us the results about HeartBeat Lost between Ambari agent to Ambari server HeartBeat Lost could be as results of poor connection between Ambari agent to Ambari server or because Ambari server was down for along time , etc Note - from Ambari GUI the machine with **HeartBeat Lost** state usually colored by yellow state clarification: the case as described here appears when `ambari-agent status` is in running state as the following ambari-agent status Found ambari-agent PID: 119315 ambari-agent running. Agent PID at: /run/ambari-agent/ambari-agent.pid Agent out at: /var/log/ambari-agent/ambari-agent.out Agent log at: /var/log/ambari-agent/ambari-agent.log

mike_bronson7 · ‎10-30-2022

so based on doc seems that we need to increase the CMSInitiatingOccupancyFraction from default 70% to higher value as for example 85% do you agree with that ?

mike_bronson7 · ‎10-28-2022

since we have the Ambari do you mean that we need to find the GC settings in yarn-env template ?

mike_bronson7 · ‎10-28-2022

we have old hadoop cluster based on HDP from hortonworks version HDP 2.6.4 cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are rhel 7.2 version , and we not see any problem on OS level also cluster include 12 workers machines ( worker include the datanode and node manager services ) the story begin when we get alerts from the smoke test script that complain about "`Detected pause in JVM or host machine`" on the standby namenode , so based on that we decided to increase the namenode heap size from 60G to 100G and above setting was based on table that show how much memory to set according to number of files in HDFS and according to the table we decided to set the namenode heapsize to 100G and then we restart the HDFS service after HDFS is completely restarted , we still see the messages about `Detected pause in JVM or host machine` , and this is really strange because we almost twice the namenode heap size so we start to perform deeply testing as by `jstat` for example we get from jsat low very of FGCT that is really good values and not point on namenode heap problem ( 1837 is the HDFS PID number ) # /usr/jdk64/jdk1.8.0_112/bin/jstat -gcutil 1837 10 10 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 and here is the messages from namenode logs 2022-10-27 14:04:49,728 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2044ms 2022-10-27 16:21:33,973 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2524ms 2022-10-27 17:31:35,333 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2444ms 2022-10-27 18:55:55,387 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2134ms 2022-10-27 19:42:00,816 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2153ms 2022-10-27 20:50:23,624 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2050ms 2022-10-27 21:07:01,240 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2343ms 2022-10-27 23:53:00,507 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2120ms 2022-10-28 00:43:30,633 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1811ms 2022-10-28 00:53:35,120 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2192ms 2022-10-28 02:07:39,660 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2353ms 2022-10-28 02:49:25,018 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1698ms 2022-10-28 03:00:20,592 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2432ms 2022-10-28 05:02:15,093 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2016ms 2022-10-28 06:52:46,672 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approxim as we can see each 1-2 hours message about pause in `JVM or host machine` is appears we checked the number of files in HDFS and number of file is 7 million files so what else we can do , we can increase a little bit the namenode heap size but my feeling is that heap size is really enough

mike_bronson7 · ‎09-12-2022

Hi all We have Ambari HDP cluster ( HDP version - 2.6.4 ) , with 420 workers linux machines ( when each worker include data node and node manager service ) Unfortunately Ambari DB is damaged , and we not have Ambari DB dump , so we cant recover Ambari DB , so actually we not have Ambari and Ambari GUI But HDFS disks on workers machines include HDFS data , and name node is still working with all data as ( journal/hdfsha/current/ ) and ( namenode/current ) So HDFS works without Ambari So regarding what I said until now - it is possible install new Ambari cluster from scratch , and then add existing working HDFS data to the cluster ? Dose hortonworks / cloudera have procedure for this process ?

mike_bronson7 · ‎12-20-2021

we have 3 Kafka brokers on Linux RHEL 7.6 ( 3 linux machines ) kafka version is 2.7.X brokers ID's are - `1010,1011,1012` from kafka described we can see the following Topic: __consumer_offsets Partition: 0 Leader: none Replicas: 1011,1010,1012 Isr: 1010 Topic: __consumer_offsets Partition: 1 Leader: 1012 Replicas: 1012,1011,1010 Isr: 1012,1011 Topic: __consumer_offsets Partition: 2 Leader: 1011 Replicas: 1010,1012,1011 Isr: 1011,1012 Topic: __consumer_offsets Partition: 3 Leader: none Replicas: 1011,1012,1010 Isr: 1010 Topic: __consumer_offsets Partition: 4 Leader: 1011 Replicas: 1012,1010,1011 Isr: 1011 Topic: __consumer_offsets Partition: 5 Leader: none Replicas: 1010,1011,1012 Isr: 1010 from Zookeeper cli we can see that broker `id 1010` not defined [zk: localhost:2181(CONNECTED) 10] ls /brokers/ids [1011, 1012] and from the log - `state-change.log` we can see the following [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-6 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-9 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-8 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-11 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-10 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-46 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-45 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-48 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-47 as the local replica for the partition is in an offline log directory (state.change.logger) [2021-12-16 14:15:36,170] WARN [Broker id=1010] Ignoring LeaderAndIsr request from controller 1010 with correlation id 485 epoch 323 for partition __consumer_offsets-49 as the local replica for the partition is in an offline log directory (state.change.logger) by ls -ltr , we can see that `controller.log` and `state-change.log` are not update from `Dec 16` -rwxr-xr-x 1 root kafka 343477146 Dec 16 14:15 controller.log -rwxr-xr-x 1 root kafka 207911766 Dec 16 14:15 state-change.log -rw-r--r-- 1 root kafka 68759461 Dec 16 14:15 kafkaServer-gc.log.6.current -rwxr-xr-x 1 root kafka 6570543 Dec 17 09:42 log-cleaner.log -rw-r--r-- 1 root kafka 524288242 Dec 20 00:39 server.log.10 -rw-r--r-- 1 root kafka 524289332 Dec 20 01:37 server.log.9 -rw-r--r-- 1 root kafka 524288452 Dec 20 02:35 server.log.8 -rw-r--r-- 1 root kafka 524288625 Dec 20 03:33 server.log.7 -rw-r--r-- 1 root kafka 524288395 Dec 20 04:30 server.log.6 -rw-r--r-- 1 root kafka 524288237 Dec 20 05:27 server.log.5 -rw-r--r-- 1 root kafka 524289136 Dec 20 06:25 server.log.4 -rw-r--r-- 1 root kafka 524288142 Dec 20 07:25 server.log.3 -rw-r--r-- 1 root kafka 524288187 Dec 20 08:21 server.log.2 -rw-r--r-- 1 root kafka 524288094 Dec 20 10:52 server.log.1 -rw-r--r-- 1 root kafka 323361 Dec 20 19:50 kafkaServer-gc.log.0.current -rw-r--r-- 1 root kafka 323132219 Dec 20 19:50 server.log -rwxr-xr-x 1 root kafka 15669106 Dec 20 19:50 kafkaServer.out what we did until now is that: we restart all 3 zookeeper servers we restart all kafka brokers but still kafka broker `1010` appears as `leader none` , and not in zookeeper data **additional info** [zk: localhost:2181(CONNECTED) 11] get /controller {"version":1,"brokerid":1011,"timestamp":"1640003679634"} cZxid = 0x4900000b0c ctime = Mon Dec 20 12:34:39 UTC 2021 mZxid = 0x4900000b0c mtime = Mon Dec 20 12:34:39 UTC 2021 pZxid = 0x4900000b0c cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x27dd7cf43350080 dataLength = 57 numChildren = 0 **from kafka01** more meta.properties # #Tue Nov 16 07:45:36 UTC 2021 cluster.id=D3KpekCETmaNveBJzE6PZg version=0 broker.id=1010 **relevant ideas** in topics disk we have the following files ( additionally to topics partitions ) -rw-r--r-- 1 root kafka 91 Nov 16 07:45 meta.properties -rw-r--r-- 1 root kafka 161 Dec 15 16:04 cleaner-offset-checkpoint -rw-r--r-- 1 root kafka 13010 Dec 15 16:20 replication-offset-checkpoint -rw-r--r-- 1 root kafka 1928 Dec 17 09:42 recovery-point-offset-checkpoint -rw-r--r-- 1 root kafka 80 Dec 17 09:42 log-start-offset-checkpoint any idea if deletion of one or more of above files can help with our issue?

mike_bronson7 · ‎06-30-2021

can you share please the link/doc that described the above table ?

mike_bronson7 · ‎06-28-2021

we have HDP cluster with 2 resource manager services , and 190 node managers services HDP version - 2.6.5 YARN version - 2.7.3 Hadoop platform - ambari 2.6.2.1 version each node manager is located on VM linux machines now we want to extend the node-managers machines to 220 machines the Question that I want to ask: dose resource-manager can support 220 node managers services ( when each node-manager service installed on one node manager linux machine ) ? what is the max limit of node-mangers services that one resource -manager can support?

Online	Offline
Last Visited	‎08-27-2024 09:17 AM

Member Since	‎08-08-2017 09:40 AM
Last Visited	‎08-27-2024 09:17 AM
Posts	1,652
Kudos received	29

Cloudera Community

Re: how to find number of CPU core on datanode ma...

Re: postgresql + ambari server failed to open port...

Re: how to stop the thrift servers by REST API

Re: Yarn + how to avoid node manager that marked a...

Yarn + how to avoid node manager that marked as lo...

how to detected HeartBeat Lost state between ambar...

Re: JVM + still have Detected pause in JVM in spit...

Re: JVM + still have Detected pause in JVM in spit...

JVM + still have Detected pause in JVM in spite he...

Ambari DB is damaged without Ambari DB backup

kafka + Leader none + and kafka broker id not sign...

Re: YARN resource manager + what is the count of n...

YARN resource manager + what is the count of node ...