Member since
08-08-2017
1652
Posts
30
Kudos Received
11
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1978 | 06-15-2020 05:23 AM | |
| 16111 | 01-30-2020 08:04 PM | |
| 2115 | 07-07-2019 09:06 PM | |
| 8265 | 01-27-2018 10:17 PM | |
| 4688 | 12-31-2017 10:12 PM |
01-12-2023
12:09 AM
I want to say also that node-manager restart or fully restart of yarn service fixed the problem , but as you know this isn't the right solution that should be every time that one of the node manager became die
... View more
01-11-2023
11:36 PM
Dear @Shelton , long time that we not meet , glad to see you again back to my Question , since we are talking on node manager , my goal is to avoid cases like node-manager service is die or not sync with the resource manager , please forgive me but I not understand why you talking about data node and exclude data node from the cluster , because the question is on different subject , and as I mention we want to understand the root cause of lost node manager and how to do proactive steps in order to avoid such of this problems additionally as I understand most of this problems are as results of bad network that break the connectivity between node manager to resources manager , so in spite some times this behavior is happening , we are trying to set the configuration that give the cluster to be stable in spite all networking problems or INFA problems let me know if my question is clear so we can continue with our discussion , and sorry again if my first post was not clearly
... View more
01-11-2023
08:33 AM
we have huge production Hadoop cluster, with HDP version 2.6.5 and ambari version 2.6.2.2 , and all machines are with OS RHEL 7.6 version the cluster size is as the following : Total workers machines - 425 ( each worker include data node and node manager service ) from time to time we get indication of lost one or two **node-manager** and this identified from Ambari as ( 424/425 when total node-manager are 425 ) in order to fix it we just restart the **node-manager** and this action fix the problem and as results we get 425/425 after some googling , we found the following parameters that maybe should be tune better yarn.client.nodemanager-connect.max-wait-ms ( its configured to 60000 ms and we think to increase it ) yarn.client.nodemanager-connect.retry-interval-ms ( its configured to 10 sec ms and we think to increase it ) yarn.nm.liveness-monitor.expiry-interval-ms ( this parameter not configured yet and we think to add it with value of 1500000 ms ) based on above details , I will appreciate to get comments or others ideas background: NodeManager is LOST means that ResourceManager haven't received heartbeats from it for a duration of nm.liveness-monitor.expiry-interval-ms milliseconds (default is 10 minutes).
... View more
Labels:
- Labels:
-
Ambari Blueprints
12-04-2022
05:58 AM
we want to find the **approach / test /cli / API** that gives us the results about HeartBeat Lost between Ambari agent to Ambari server HeartBeat Lost could be as results of poor connection between Ambari agent to Ambari server or because Ambari server was down for along time , etc Note - from Ambari GUI the machine with **HeartBeat Lost** state usually colored by yellow state clarification: the case as described here appears when `ambari-agent status` is in running state as the following ambari-agent status Found ambari-agent PID: 119315 ambari-agent running. Agent PID at: /run/ambari-agent/ambari-agent.pid Agent out at: /var/log/ambari-agent/ambari-agent.out Agent log at: /var/log/ambari-agent/ambari-agent.log
... View more
Labels:
- Labels:
-
Ambari Blueprints
10-30-2022
05:49 AM
so based on doc seems that we need to increase the CMSInitiatingOccupancyFraction from default 70% to higher value as for example 85% do you agree with that ?
... View more
10-28-2022
01:50 AM
since we have the Ambari do you mean that we need to find the GC settings in yarn-env template ?
... View more
10-28-2022
12:32 AM
we have old hadoop cluster based on HDP from hortonworks version HDP 2.6.4 cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are rhel 7.2 version , and we not see any problem on OS level also cluster include 12 workers machines ( worker include the datanode and node manager services ) the story begin when we get alerts from the smoke test script that complain about "`Detected pause in JVM or host machine`" on the standby namenode , so based on that we decided to increase the namenode heap size from 60G to 100G and above setting was based on table that show how much memory to set according to number of files in HDFS and according to the table we decided to set the namenode heapsize to 100G and then we restart the HDFS service after HDFS is completely restarted , we still see the messages about `Detected pause in JVM or host machine` , and this is really strange because we almost twice the namenode heap size so we start to perform deeply testing as by `jstat` for example we get from jsat low very of FGCT that is really good values and not point on namenode heap problem ( 1837 is the HDFS PID number ) # /usr/jdk64/jdk1.8.0_112/bin/jstat -gcutil 1837 10 10 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 0.00 1.95 32.30 34.74 97.89 - 197 173.922 2 1.798 175.720 and here is the messages from namenode logs 2022-10-27 14:04:49,728 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2044ms 2022-10-27 16:21:33,973 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2524ms 2022-10-27 17:31:35,333 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2444ms 2022-10-27 18:55:55,387 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2134ms 2022-10-27 19:42:00,816 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2153ms 2022-10-27 20:50:23,624 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2050ms 2022-10-27 21:07:01,240 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2343ms 2022-10-27 23:53:00,507 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2120ms 2022-10-28 00:43:30,633 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1811ms 2022-10-28 00:53:35,120 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2192ms 2022-10-28 02:07:39,660 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2353ms 2022-10-28 02:49:25,018 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1698ms 2022-10-28 03:00:20,592 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2432ms 2022-10-28 05:02:15,093 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 2016ms 2022-10-28 06:52:46,672 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approxim as we can see each 1-2 hours message about pause in `JVM or host machine` is appears we checked the number of files in HDFS and number of file is 7 million files so what else we can do , we can increase a little bit the namenode heap size but my feeling is that heap size is really enough
... View more
Labels:
- Labels:
-
Apache Hadoop
09-12-2022
03:57 AM
Hi all We have Ambari HDP cluster ( HDP version - 2.6.4 ) , with 420 workers linux machines ( when each worker include data node and node manager service ) Unfortunately Ambari DB is damaged , and we not have Ambari DB dump , so we cant recover Ambari DB , so actually we not have Ambari and Ambari GUI But HDFS disks on workers machines include HDFS data , and name node is still working with all data as ( journal/hdfsha/current/ ) and ( namenode/current ) So HDFS works without Ambari So regarding what I said until now - it is possible install new Ambari cluster from scratch , and then add existing working HDFS data to the cluster ? Dose hortonworks / cloudera have procedure for this process ?
... View more
Labels:
- Labels:
-
HDFS
08-20-2022
02:23 PM
first thank you so much , for your help , I see in the post the following example: [{"ConfigGroup":{"id":2,"cluster_name":"c1","group_name":"A config group","tag":"HDFS","description":"A config group","hosts":[{"host_name":"host1"}],"service_config_version_note":"change","desired_configs":[{"type":"hdfs-site","tag":"version1443587493807","properties":{"dfs.replication":"2","dfs.datanode.du.reserved":"1073741822"}}]}}] I will appreciate , to get full example about how to run this API , by using curl or full Ambari API note - about - version1443587493807 , is this version number is "random" number that I need to set ?
... View more
08-20-2022
02:13 PM
hi smohanty can you show me example for - how to config the dfs.replication by using full example ( as with curl ) ? you mentioned the - version1443587493807 , according to what I need to set this "version number"?
... View more