About jonathanhurley

jonathanhurley · ‎04-12-2017

To help you, we'd need some more information: - Which version of HDP are you actually running on currently? Is it 2.5.3.0-37? - Can you post the entire output from the install command? - What is the content of /usr/hdp on this host which is having trouble?

jonathanhurley · ‎03-09-2017

Although this will technically work, there is a supported way of doing this. The Falcon alert definition can specify the parameter to monitor for determining whether to use HTTP or HTTPS: { "name": "falcon_server_webui", "label": "Falcon Server Web UI", "description": "This host-level alert is triggered if the Falcon Server Web UI is unreachable.", "interval": 1, "scope": "ANY", "enabled": true, "source": { "type": "WEB", "uri": { "http": "{{falcon-env/falcon_port}}", "https": "{{falcon-env/falcon_port}}", "https_property": "{{hdfs-site/falcon.enableTLS}}", "https_property_value": "true", "default_port": 15000, "kerberos_keytab": "{{falcon-startup.properties/*.falcon.http.authentication.kerberos.keytab}}", "kerberos_principal": "{{falcon-startup.properties/*.falcon.http.authentication.kerberos.principal}}", "connection_timeout": 5 }, "reporting": { "ok": { "text": "HTTP {0} response in {2:.3f}s" }, "warning": { "text": "HTTP {0} response from {1} in {2:.3f}s ({3})" }, "critical": { "text": "Connection failed to {1} ({3})" } } } } Falcon should respect the port, regardless of plaintext vs encryption. However, this way, the alert framework will understand whether to use plaintext or TLS.

jonathanhurley · ‎03-09-2017

The logs indicate that the port on the host for MySQL isn't open. Your CLI tests indicate it is. One of them has to be wrong 🙂 Can you do a grep jdbc /etc/ambari-server/conf/ambari.properties and see if the DB properties look correct?

jonathanhurley · ‎03-03-2017

Yes, I believe that you can. There is a folder which ships with Ambari Server in /var/lib/ambari-server/resources/custom_actions/scripts. You can have Ambari execute these scripts on the agents. For example, when you create a new cluster, Ambari "checks the hosts" for things like memory, OS, problems. This script is the check_host.py script. It's invoked like: { "RequestInfo": { "action": "check_host", "context": "Check host", "parameters": { "check_execute_list": "host_resolution_check", "jdk_location": "http://192.168.64.1:8080/resources/", "threshold": "20", "hosts": "c6401.ambari.apache.org,c6402.ambari.apache.org,c6403.ambari.apache.org" } }, "Requests/resource_filters": [ { "hosts": "c6401.ambari.apache.org,c6402.ambari.apache.org,c6403.ambari.apache.org" } ] } Where "action" is the name of the script. The action is defined in /var/lib/ambari-server/resources/custom_action_definitions/system_action_definitions.xml like so: <actionDefinition> <actionName>check_host</actionName> <actionType>SYSTEM</actionType> <inputs/> <targetService/> <targetComponent/> <defaultTimeout>60</defaultTimeout> <description>General check for host</description> <targetType>ANY</targetType> <permissions>HOST.ADD_DELETE_HOSTS</permissions> </actionDefinition>

jonathanhurley · ‎02-17-2017

Yes, my example is correct. There is no way to query directly for a specific property; you can only query by bean name. However, for alerts, we use a slash as a delimiter. The metric alert will remove the "VolumeFailuresTotal" and retrieve the "Hadoop:service=NameNode,name=FSNamesystemState" bean. Then it will extract the "VolumeFailuresTotal" metric.

jonathanhurley · ‎02-16-2017

Sure, you'd need to execute a POST to create the new alert: POST api/v1/clusters/<cluster-name>/alert_definitions { "AlertDefinition": { "component_name": "NAMENODE", "description": "This service-level alert is triggered if the total number of volume failures across the cluster is greater than the configured critical threshold.", "enabled": true, "help_url": null, "ignore_host": false, "interval": 2, "label": "NameNode Volume Failures", "name": "namenode_volume_failures", "scope": "ANY", "service_name": "HDFS", "source": { "jmx": { "property_list": [ "Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal" ], "value": "{0}" }, "reporting": { "ok": { "text": "There are {0} volume failures" }, "warning": { "text": "There are {0} volume failures", "value": 1 }, "critical": { "text": "There are {0} volume failures", "value": 1 }, "units": "Volume(s)" }, "type": "METRIC", "uri": { "http": "{{hdfs-site/dfs.namenode.http-address}}", "https": "{{hdfs-site/dfs.namenode.https-address}}", "https_property": "{{hdfs-site/dfs.http.policy}}", "https_property_value": "HTTPS_ONLY", "kerberos_keytab": "{{hdfs-site/dfs.web.authentication.kerberos.keytab}}", "kerberos_principal": "{{hdfs-site/dfs.web.authentication.kerberos.principal}}", "default_port": 0, "connection_timeout": 5, "high_availability": { "nameservice": "{{hdfs-site/dfs.internal.nameservices}}", "alias_key": "{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}", "http_pattern": "{{hdfs-site/dfs.namenode.http-address.{{ha-nameservice}}.{{alias}}}}", "https_pattern": "{{hdfs-site/dfs.namenode.https-address.{{ha-nameservice}}.{{alias}}}}" } } } } } This will create a new METRIC alert which runs every 2 minutes.

jonathanhurley · ‎02-16-2017

It depends on how you want to monitor the failed disks. You can always write your own script alert in Python to monitor the various disks. However, if NameNode has a JMX metric which it exposes for this, you can also create a much simpler metric alert. It seems like Hadoop:service=NameNode,name=NameNodeInfo/LiveNodes contains escaped JSON of every DataNode. Metrics can't help you there, but there is a simpler global failed volume metric: Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal You could try to use that metric to monitor failures. If either of these approaches sound feasible, I can try to point you in the right direction to creating the alert.

jonathanhurley · ‎02-14-2017

Currently no, there is not. I believe there is a Jira open for changing how we send arguments to a script dispatcher so they are parameterized. Some fields, such as host name and component name, are not always present, so the current model simply omits them.

jonathanhurley · ‎02-07-2017

I think this goes back to the whole "dead is bad" theory. If I recall correctly, there was a metric Ambari was monitoring once on HBase - it was for "Dead RegionServers". We incorrectly assumed that "dead" was "bad". Because of this, while decommissioning a RegionServer, alerts would trigger (and not go away for a long time). In the end, it was determined that this metric wasn't really something which needed alerting on. HDFS is a little different - I believe that a DataNode is marked as stale if it hasn't reported in within 30 seconds and marked as dead if it hasn't reported within 1 minute. The problem here is that action is taken by the NameNode in this case - it will begin replicating blocks when it believes a DataNode is dead. So, we alert on it since it's something that is actively causing changes in the cluster data. The NameNode actually has metrics for differentiating "dead" vs "decommissiong dead": "NumLiveDataNodes": 3, "NumDeadDataNodes": 1, "NumDecomLiveDataNodes": 0, "NumDecomDeadDataNodes": 1, In the above example, Ambari won't worry about dead nodes which are marked as known decommissioning, but we will worry about this which are unexpected.

jonathanhurley · ‎02-07-2017

Can you specify which alert is being triggered? Most likely, it's an alert based on a master service's metrics. For example, if you decommission a DataNode and you place that DataNode into Maintenance Mode, then Ambari won't file alerts for it. However, if the NameNode broadcasts a metric that indicates there's a problem with the liveliness of the DataNodes, then Ambari will display that alert. This is because the master service is running on a separate machine and doesn't care about the maintenance mode of he affected slave. Each service is different - some services understand that a decommission means that the node shouldn't be stale and some still create the metric to indicate staleness for a short period of time.

Online	Offline
Last Visited	‎02-13-2024 11:04 AM

Member Since	‎10-14-2015 01:53 PM
Last Visited	‎02-13-2024 11:04 AM
Posts	165
Kudos received	63

Cloudera Community

Re: Need the email related fields for JSON object ...

Re: How to interpret alerts in Ambari UI

Re: Ambari not loading changes to alerts.json

Re: upgrade ambari-server 2.5.2 to 2.6.1

Re: How to purge/reset Ambari alerts

Re: Error in installing services after adding a no...

Re: Ambari Falcon WEB UI alert always checks falco...

Re: Ambari-server abend on startup - inconsitent e...

Re: Execute custom scripts on hosts

Re: How to alert when datanode has a disk/volume f...

Re: How to alert when datanode has a disk/volume f...

Re: How to alert when datanode has a disk/volume f...

Re: Post Ambari alerts on slack.

Re: Node in maintenance mode throws stale alert fr...

Re: Node in maintenance mode throws stale alert fr...