Created 10-01-2017 02:39 PM
Hello
We have a 9 DataNode CentOs 6 cluster running HDP 2.6.1 & Ambari 2.5.1
Lately we've noticed that one data node lost heartbeat in Ambari. We restarted the ambari agent on the data node but it didn't help.
We restarted the entire DataNode server but no luck.
We did notice that:
- /usr/bin/python /usr/lib/python2.6/site-packages/ambari_agent/main.py start is always on 100% cpu (centos6), while in all other nodes it barely uses 1%.
- The ambari-agent.log shows no errors besides a timeout while checking the JCE. It runs the following command:
"/usr/jdk64/jdk1.8.0_77/bin/java -jar /var/lib/ambari-agent/tools/jcepolicyinfo.jar -tu" and receives a timeout.
Running the command manually does return a result after 6 seconds, while in other datanodes in just 1 second. (thus the timeout which is 5 seconds).
- ambari agent is 2.5.1.0-159.x86_64 same as all other DN's
- java version is 1.8.0_77 same as all other DN's
- NodeManager works without a problem on that data node and it runs applications fine. It's just the heartbeat/ambari-agent issue.
- Name resolution & netwrok between data node & ambari is working fine.
Any advice will be appreciated.
Adi
Created 10-03-2017 12:08 PM
Problem solved!
In case anyone else encounters the following:
1. Servers losing heartbeat for no reason
2. ambari agent always hogs 100% cpu
3. running "yarn application -list" produces results but slow than other servers.
4. In general the server is slow
The fix was to set the CPU in BIOS to use profile "maximum performance"
For some reason the server was set (in our case DELL, but it is in every BIOS out there) to default CPU profile which means low resources for low voltage use.
Created 10-01-2017 03:27 PM
More info:
On ambari-server.log i can see that it acknowledges the loss of heartbeat but it seems they do communicated:
01 Oct 2017 18:22:39,870 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:159 - Heartbeat lost from host hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 INFO [ambari-hearbeat-monitor] TopologyManager:671 - Hearbeat for host hdp-dn01-drp.hadoop.local lost thus removing it from available hosts. 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component METRICS_MONITOR on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component DRUID_MIDDLEMANAGER on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component DRUID_HISTORICAL on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component DATANODE on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,873 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component NODEMANAGER on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:23:23,042 WARN [qtp-ambari-agent-1563184] HeartBeatHandler:235 - Host is in HEARTBEAT_LOST state - sending register command 01 Oct 2017 18:23:30,029 INFO [qtp-ambari-agent-1563184] HeartBeatHandler:425 - agentOsType = centos6 01 Oct 2017 18:23:30,045 INFO [qtp-ambari-agent-1563184] HostImpl:329 - Received host registration, host=[hostname=hdp-dn01-drp,fqdn=hdp-dn01-drp.hadoop.local,domain=hadoop.local,architecture=x86_64,processorcount=12,physicalprocessorcount=12,osname=centos,osversion=6.8,osfamily=redhat,memory=65901056,uptime_hours=0,mounts=(available=38553776,mountpoint=/,used=10299852,percent=22%,size=51475068,device=/dev/mapper/vg_system-LogVol00,type=ext4)(available=390059,mountpoint=/boot,used=71993,percent=16%,size=487652,device=/dev/sda2,type=ext4)(available=204304,mountpoint=/boot/efi,used=276,percent=1%,size=204580,device=/dev/sda1,type=vfat)(available=1654503504,mountpoint=/grid/0,used=110677756,percent=7%,size=1859652852,device=/dev/sda5,type=ext4)(available=1713507724,mountpoint=/grid/1,used=111041964,percent=7%,size=1922198324,device=/dev/sdb1,type=ext4)(available=1716047400,mountpoint=/grid/2,used=108502288,percent=6%,size=1922198324,device=/dev/sdc1,type=ext4)(available=1716667196,mountpoint=/grid/3,used=107882492,percent=6%,size=1922198324,device=/dev/sdd1,type=ext4)(available=1709492804,mountpoint=/grid/4,used=115056884,percent=7%,size=1922198324,device=/dev/sde1,type=ext4)(available=1710666700,mountpoint=/grid/5,used=113882988,percent=7%,size=1922198324,device=/dev/sdf1,type=ext4)(available=1709508880,mountpoint=/grid/6,used=115040808,percent=7%,size=1922198324,device=/dev/sdg1,type=ext4)(available=1705253584,mountpoint=/grid/7,used=119296104,percent=7%,size=1922198324,device=/dev/sdh1,type=ext4)(available=1708647680,mountpoint=/grid/8,used=115902008,percent=7%,size=1922198324,device=/dev/sdi1,type=ext4)(available=1713886116,mountpoint=/grid/9,used=110663572,percent=7%,size=1922198324,device=/dev/sdj1,type=ext4)(available=1711301604,mountpoint=/grid/10,used=113248084,percent=7%,size=1922198324,device=/dev/sdk1,type=ext4)(available=1712490508,mountpoint=/grid/11,used=112059180,percent=7%,size=1922198324,device=/dev/sdl1,type=ext4)] , registrationTime=1506871410029, agentVersion=2.5.1.0 01 Oct 2017 18:23:30,045 INFO [qtp-ambari-agent-1563184] TopologyManager:592 - TopologyManager.onHostRegistered: Entering 01 Oct 2017 18:23:30,045 INFO [qtp-ambari-agent-1563184] TopologyManager:594 - TopologyManager.onHostRegistered: host = hdp-dn01-drp.hadoop.local is already associated with the cluster or is currently being processed 01 Oct 2017 18:23:30,052 INFO [qtp-ambari-agent-1563184] HeartBeatHandler:504 - Recovery configuration set to RecoveryConfig{, type=AUTO_START, maxCount=6, windowInMinutes=60, retryGap=5, maxLifetimeCount=1024, components=null, recoveryTimestamp=1506871410051}
Created 10-03-2017 12:08 PM
Problem solved!
In case anyone else encounters the following:
1. Servers losing heartbeat for no reason
2. ambari agent always hogs 100% cpu
3. running "yarn application -list" produces results but slow than other servers.
4. In general the server is slow
The fix was to set the CPU in BIOS to use profile "maximum performance"
For some reason the server was set (in our case DELL, but it is in every BIOS out there) to default CPU profile which means low resources for low voltage use.