Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Modified Ambari Disk alert Threshold is not getting into Effect

Highlighted

Re: Modified Ambari Disk alert Threshold is not getting into Effect

Expert Contributor
@ jonathan hurley: 
On node where it shows 78%, here are the logs specific to HDFS disk usage
/var/lib/ambari-agent/cache/alerts/definitions.json
"ignore_host": false,
  "name": "ambari_agent_disk_usage",
  "componentName": "AMBARI_AGENT",
  "interval": 1,
  "clusterId": 2,
  "uuid": "03d1a5aa-d3ea-41c6-b52c-187da3128f74",
  "label": "Host Disk Usage",
  "definitionId": 43,
  "source": {
  "path": "alert_disk_space.py",
  "type": "SCRIPT",
  "parameters": [
  {
  "display_name": "Minimum Free Space",
  "name": "minimum.free.space",
  "value": "5.0E9",
  "threshold": "WARNING",
  "units": "bytes",
  "type": "NUMERIC",
  "description": "The overall amount of free disk space left before an alert is triggered."
  },
  {
  "display_name": "Warning",
  "name": "percent.used.space.warning.threshold",
  "value": "0.9",
  "threshold": "WARNING",
  "units": "%",
  "type": "PERCENT",
  "description": "The percent of disk space consumed before a warning is triggered."
  },
  {
  "display_name": "Critical",
  "name": "percent.free.space.critical.threshold",
  "value": "0.95",
  "threshold": "CRITICAL",
  "units": "%",
  "type": "PERCENT",
  "description": "The percent of disk space consumed before a critical alert is triggered."
  }
  ]
  },
  "serviceName": "AMBARI",
  "scope": "HOST",
  "enabled": true,

  "description": "This host-level alert is triggered if the amount of 
disk space used goes above specific thresholds. The default threshold 
values are 90% for WARNING and 95% for CRITICAL"
==========================================
Ambari-agent Log
WARNING
 2016-06-10 11:19:11,752 FileCache.py:162 - Error occurred during cache 
update. Error tolerate setting is set to true, so ignoring this error 
and continuing with current cache. Error details: Can not download file 
from url http://EN:8080/resources//host_scripts/.hash : <urlopen 
error [Errno -2] Name or service not known>
WARNING 2016-06-10 
11:19:11,752 FileCache.py:162 - Error occurred during cache update. 
Error tolerate setting is set to true, so ignoring this error and 
continuing with current cache. Error details: Can not download file from
 url http://EN:8080/resources//stacks/HDP/2.0.6/hooks/.hash : 
<urlopen error [Errno -2] Name or service not known>
WARNING 
2016-06-10 11:19:11,753 FileCache.py:162 - Error occurred during cache 
update. Error tolerate setting is set to true, so ignoring this error 
and continuing with current cache. Error details: Can not download file 
from url 
http://EN:8080/resources//common-services/ZOOKEEPER/3.4.5.2.0/package/.hash
 : <urlopen error [Errno -2] Name or service not known>
INFO 
2016-06-10 11:19:20,473 Heartbeat.py:78 - Building Heartbeat: 
{responseId = 23210, timestamp = 1465528760473, commandsInProgress = 
False, componentsMapped = True}
INFO 2016-06-10 11:19:20,482 Controller.py:268 - Heartbeat response received (id = 23211)
WARNING
 2016-06-10 11:19:24,683 base_alert.py:417 - 
[Alert][yarn_resourcemanager_webui] HA nameservice value is present but 
there are no aliases for {{yarn-site/yarn.resourcemanager.ha.rm-ids}}
WARNING
 2016-06-10 11:19:24,694 base_alert.py:417 - 
[Alert][namenode_hdfs_blocks_health] HA nameservice value is present but
 there are no aliases for 
{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
WARNING 2016-06-10 
11:19:24,700 base_alert.py:417 - [Alert][namenode_rpc_latency] HA 
nameservice value is present but there are no aliases for 
{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
WARNING 2016-06-10 
11:19:24,702 base_alert.py:417 - [Alert][namenode_webui] HA nameservice 
value is present but there are no aliases for 
{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
WARNING 2016-06-10 
11:19:24,703 base_alert.py:417 - 
[Alert][namenode_hdfs_pending_deletion_blocks] HA nameservice value is 
present but there are no aliases for 
{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
WARNING 2016-06-10 
11:19:24,706 base_alert.py:417 - [Alert][datanode_health_summary] HA 
nameservice value is present but there are no aliases for 
{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
WARNING 2016-06-10 
11:19:24,707 base_alert.py:417 - 
[Alert][namenode_hdfs_capacity_utilization] HA nameservice value is 
present but there are no aliases for 
{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
WARNING 2016-06-10 
11:19:24,712 base_alert.py:417 - [Alert][namenode_directory_status] HA 
nameservice value is present but there are no aliases for 
{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
INFO 2016-06-10 
11:19:30,483 Heartbeat.py:78 - Building Heartbeat: {responseId = 23211, 
timestamp = 1465528770483, commandsInProgress = False, componentsMapped =
 True}
INFO 2016-06-10 11:19:30,487 Controller.py:268 - Heartbeat response received (id = 23212)


==========================================
Alerts log
INFO 2016-06-10 11:19:00,401 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:19:00,402 logger.py:67 - call['test -w /run/user/1002'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:19:00,407 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,541 logger.py:67 - call['test -w /'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,546 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,546 logger.py:67 - call['test -w /dev'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,551 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,552 logger.py:67 - call['test -w /dev/shm'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,557 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,557 logger.py:67 - call['test -w /run'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,562 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,562 logger.py:67 - call['test -w /sys/fs/cgroup'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,567 logger.py:67 - call returned (1, '')
INFO 2016-06-10 11:20:00,568 logger.py:67 - call['test -w /data0'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,573 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,573 logger.py:67 - call['test -w /data1'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,578 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,578 logger.py:67 - call['test -w /data2'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,583 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,584 logger.py:67 - call['test -w /data3'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,589 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,589 logger.py:67 - call['test -w /data4'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,594 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,594 logger.py:67 - call['test -w /data5'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,599 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,600 logger.py:67 - call['test -w /run/user/1017'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,605 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,605 logger.py:67 - call['test -w /run/user/1014'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,610 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,610 logger.py:67 - call['test -w /run/user/1015'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,615 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,616 logger.py:67 - call['test -w /run/user/1016'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,621 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,621 logger.py:67 - call['test -w /run/user/1012'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,626 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,626 logger.py:67 - call['test -w /run/user/0'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,631 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,632 logger.py:67 - call['test -w /run/user/1009'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,637 logger.py:67 - call returned (0, '')
INFO 2016-06-10 11:20:00,637 logger.py:67 - call['test -w /run/user/1002'] {'sudo': True, 'timeout': 5}
INFO 2016-06-10 11:20:00,642 logger.py:67 - call returned (0, '')
Highlighted

Re: Modified Ambari Disk alert Threshold is not getting into Effect

Super Collaborator

Can you attach the files in their entirety?

Highlighted

Re: Modified Ambari Disk alert Threshold is not getting into Effect

Expert Contributor

@Jonathan Hurley - Attached the logs of the three files in a notepad (compressed)

amabri-alerts-log.zip

Highlighted

Re: Modified Ambari Disk alert Threshold is not getting into Effect

Super Collaborator

OK, so the logs look fine. Something else is going on here that we're not seeing. Possibly hidden in earlier logs. Could you:

grep "03d1a5aa-d3ea-41c6-b52c-187da3128f74" /var/log/ambari-agent/ambari-agent*
grep "\[AlertScheduler\] Scheduling" /var/log/ambari-agent/ambari-agent.log  -A10 -B10
grep "ambari_agent_disk_usage" /var/log/ambari-agent/ambari-agent* -A10 -B1

And post the output of those commands here. Is there also anything suspicious in /var/log/ambari-agent/ambari-agent.out ?

Highlighted

Re: Modified Ambari Disk alert Threshold is not getting into Effect

Expert Contributor
@Jonathan Hurley : Below are the outputs

Sorry was away for few days. 

[ambari@ip-172-27-3-43.ap-southeast-1.compute.internal]:/home/ambari $
 grep "03d1a5aa-d3ea-41c6-b52c-187da3128f74" 
/var/log/ambari-agent/ambari-agent* 
[ambari@ip-172-27-3-43.ap-southeast-1.compute.internal]:/home/ambari $ 
grep "\[AlertScheduler\] Scheduling" 
/var/log/ambari-agent/ambari-agent.log -A10 -B10 
[ambari@ip-172-27-3-43.ap-southeast-1.compute.internal]:/home/ambari $ 
grep "ambari_agent_disk_usage" /var/log/ambari-agent/ambari-agent* -A10 
-B1 [ambari@ip-172-27-3-43.ap-southeast-1.compute.internal]:/home/ambari
 $

Dont see anything suspicious on the ambari-agent.out

$ tail -100 /var/log/ambari-agent/ambari-agent.out 2016-06-07 
18:35:24,670 - call['test -w /'] {'sudo': True, 'timeout': 5} 2016-06-07
 18:35:24,676 - call returned (0, '') 2016-06-07 18:35:24,677 - 
call['test -w /dev'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,681 - call returned (0, '') 2016-06-07 18:35:24,682 - 
call['test -w /dev/shm'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,686 - call returned (0, '') 2016-06-07 18:35:24,687 - 
call['test -w /run'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,691 - call returned (0, '') 2016-06-07 18:35:24,692 - 
call['test -w /sys/fs/cgroup'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,696 - call returned (1, '') 2016-06-07 18:35:24,697 - 
call['test -w /data0'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,701 - call returned (0, '') 2016-06-07 18:35:24,702 - 
call['test -w /data1'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,706 - call returned (0, '') 2016-06-07 18:35:24,706 - 
call['test -w /data2'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,711 - call returned (0, '') 2016-06-07 18:35:24,711 - 
call['test -w /data3'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,716 - call returned (0, '') 2016-06-07 18:35:24,716 - 
call['test -w /data4'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,721 - call returned (0, '') 2016-06-07 18:35:24,721 - 
call['test -w /data5'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,726 - call returned (0, '') 2016-06-07 18:35:24,726 - 
call['test -w /run/user/1017'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,730 - call returned (0, '') 2016-06-07 18:35:24,730 - 
call['test -w /run/user/1014'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,735 - call returned (0, '') 2016-06-07 18:35:24,735 - 
call['test -w /run/user/1015'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,739 - call returned (0, '') 2016-06-07 18:35:24,739 - 
call['test -w /run/user/1016'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,744 - call returned (0, '') 2016-06-07 18:35:24,744 - 
call['test -w /run/user/1012'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,748 - call returned (0, '') 2016-06-07 18:35:24,748 - 
call['test -w /run/user/0'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,752 - call returned (0, '') 2016-06-07 18:35:24,753 - 
call['test -w /run/user/1009'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,757 - call returned (0, '') 2016-06-07 18:35:24,757 - 
call['test -w /run/user/1002'] {'sudo': True, 'timeout': 5} 2016-06-07 
18:35:24,761 - call returned (0, '')

Re: Modified Ambari Disk alert Threshold is not getting into Effect

Super Collaborator

Something is really wrong here - that should have showed some output giving the information which you posted above. At this point, we need some clean logs to figure out what's wrong. I'd suggest this:

  • Stop the agent on the host that's having trouble: ambari-agent stop
  • Remove (or backup) the logs in /var/log/ambari-agent
  • Remove /var/lib/ambari-agent/cache/alerts/definitions.json
  • Start the agent: ambari-agent start

Wait a few minutes for the agent to heartbeat. Now, ZIP up the definitions.json and ambari-agent.log file and attach them. This should show us why it's not scheduling the alert.

Don't have an account?
Coming from Hortonworks? Activate your account here