Support Questions

Find answers, ask questions, and share your expertise

Ambari-Agent high cpu & Datanode without heartbeat

avatar
Super Collaborator

Hello

We have a 9 DataNode CentOs 6 cluster running HDP 2.6.1 & Ambari 2.5.1
Lately we've noticed that one data node lost heartbeat in Ambari. We restarted the ambari agent on the data node but it didn't help.
We restarted the entire DataNode server but no luck.
We did notice that:
- /usr/bin/python /usr/lib/python2.6/site-packages/ambari_agent/main.py start is always on 100% cpu (centos6), while in all other nodes it barely uses 1%.
- The ambari-agent.log shows no errors besides a timeout while checking the JCE. It runs the following command:
"/usr/jdk64/jdk1.8.0_77/bin/java -jar /var/lib/ambari-agent/tools/jcepolicyinfo.jar -tu" and receives a timeout.
Running the command manually does return a result after 6 seconds, while in other datanodes in just 1 second. (thus the timeout which is 5 seconds).

- ambari agent is 2.5.1.0-159.x86_64 same as all other DN's
- java version is 1.8.0_77 same as all other DN's

- NodeManager works without a problem on that data node and it runs applications fine. It's just the heartbeat/ambari-agent issue.
- Name resolution & netwrok between data node & ambari is working fine.

Any advice will be appreciated.

Adi

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Problem solved!
In case anyone else encounters the following:
1. Servers losing heartbeat for no reason
2. ambari agent always hogs 100% cpu
3. running "yarn application -list" produces results but slow than other servers.
4. In general the server is slow

The fix was to set the CPU in BIOS to use profile "maximum performance"
For some reason the server was set (in our case DELL, but it is in every BIOS out there) to default CPU profile which means low resources for low voltage use.

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

More info:

On ambari-server.log i can see that it acknowledges the loss of heartbeat but it seems they do communicated:

01 Oct 2017 18:22:39,870 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:159 - Heartbeat lost from host hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 INFO [ambari-hearbeat-monitor] TopologyManager:671 - Hearbeat for host hdp-dn01-drp.hadoop.local lost thus removing it from available hosts. 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component METRICS_MONITOR on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component DRUID_MIDDLEMANAGER on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component DRUID_HISTORICAL on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,872 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component DATANODE on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:22:39,873 WARN [ambari-hearbeat-monitor] HeartbeatMonitor:174 - Setting component state to UNKNOWN for component NODEMANAGER on hdp-dn01-drp.hadoop.local 01 Oct 2017 18:23:23,042 WARN [qtp-ambari-agent-1563184] HeartBeatHandler:235 - Host is in HEARTBEAT_LOST state - sending register command 01 Oct 2017 18:23:30,029 INFO [qtp-ambari-agent-1563184] HeartBeatHandler:425 - agentOsType = centos6 01 Oct 2017 18:23:30,045 INFO [qtp-ambari-agent-1563184] HostImpl:329 - Received host registration, host=[hostname=hdp-dn01-drp,fqdn=hdp-dn01-drp.hadoop.local,domain=hadoop.local,architecture=x86_64,processorcount=12,physicalprocessorcount=12,osname=centos,osversion=6.8,osfamily=redhat,memory=65901056,uptime_hours=0,mounts=(available=38553776,mountpoint=/,used=10299852,percent=22%,size=51475068,device=/dev/mapper/vg_system-LogVol00,type=ext4)(available=390059,mountpoint=/boot,used=71993,percent=16%,size=487652,device=/dev/sda2,type=ext4)(available=204304,mountpoint=/boot/efi,used=276,percent=1%,size=204580,device=/dev/sda1,type=vfat)(available=1654503504,mountpoint=/grid/0,used=110677756,percent=7%,size=1859652852,device=/dev/sda5,type=ext4)(available=1713507724,mountpoint=/grid/1,used=111041964,percent=7%,size=1922198324,device=/dev/sdb1,type=ext4)(available=1716047400,mountpoint=/grid/2,used=108502288,percent=6%,size=1922198324,device=/dev/sdc1,type=ext4)(available=1716667196,mountpoint=/grid/3,used=107882492,percent=6%,size=1922198324,device=/dev/sdd1,type=ext4)(available=1709492804,mountpoint=/grid/4,used=115056884,percent=7%,size=1922198324,device=/dev/sde1,type=ext4)(available=1710666700,mountpoint=/grid/5,used=113882988,percent=7%,size=1922198324,device=/dev/sdf1,type=ext4)(available=1709508880,mountpoint=/grid/6,used=115040808,percent=7%,size=1922198324,device=/dev/sdg1,type=ext4)(available=1705253584,mountpoint=/grid/7,used=119296104,percent=7%,size=1922198324,device=/dev/sdh1,type=ext4)(available=1708647680,mountpoint=/grid/8,used=115902008,percent=7%,size=1922198324,device=/dev/sdi1,type=ext4)(available=1713886116,mountpoint=/grid/9,used=110663572,percent=7%,size=1922198324,device=/dev/sdj1,type=ext4)(available=1711301604,mountpoint=/grid/10,used=113248084,percent=7%,size=1922198324,device=/dev/sdk1,type=ext4)(available=1712490508,mountpoint=/grid/11,used=112059180,percent=7%,size=1922198324,device=/dev/sdl1,type=ext4)] , registrationTime=1506871410029, agentVersion=2.5.1.0 01 Oct 2017 18:23:30,045 INFO [qtp-ambari-agent-1563184] TopologyManager:592 - TopologyManager.onHostRegistered: Entering 01 Oct 2017 18:23:30,045 INFO [qtp-ambari-agent-1563184] TopologyManager:594 - TopologyManager.onHostRegistered: host = hdp-dn01-drp.hadoop.local is already associated with the cluster or is currently being processed 01 Oct 2017 18:23:30,052 INFO [qtp-ambari-agent-1563184] HeartBeatHandler:504 - Recovery configuration set to RecoveryConfig{, type=AUTO_START, maxCount=6, windowInMinutes=60, retryGap=5, maxLifetimeCount=1024, components=null, recoveryTimestamp=1506871410051}

avatar
Super Collaborator

Problem solved!
In case anyone else encounters the following:
1. Servers losing heartbeat for no reason
2. ambari agent always hogs 100% cpu
3. running "yarn application -list" produces results but slow than other servers.
4. In general the server is slow

The fix was to set the CPU in BIOS to use profile "maximum performance"
For some reason the server was set (in our case DELL, but it is in every BIOS out there) to default CPU profile which means low resources for low voltage use.