Member since
10-14-2015
165
Posts
63
Kudos Received
27
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2776 | 12-11-2018 03:42 PM | |
2434 | 04-13-2018 09:17 PM | |
1556 | 02-08-2018 06:34 PM | |
3501 | 01-24-2018 02:18 PM | |
8483 | 10-11-2017 07:27 PM |
04-12-2017
03:54 PM
To help you, we'd need some more information: - Which version of HDP are you actually running on currently? Is it 2.5.3.0-37? - Can you post the entire output from the install command? - What is the content of /usr/hdp on this host which is having trouble?
... View more
03-09-2017
11:50 PM
Although this will technically work, there is a supported way of doing this. The Falcon alert definition can specify the parameter to monitor for determining whether to use HTTP or HTTPS: {
"name": "falcon_server_webui",
"label": "Falcon Server Web UI",
"description": "This host-level alert is triggered if the Falcon Server Web UI is unreachable.",
"interval": 1,
"scope": "ANY",
"enabled": true,
"source": {
"type": "WEB",
"uri": {
"http": "{{falcon-env/falcon_port}}",
"https": "{{falcon-env/falcon_port}}",
"https_property": "{{hdfs-site/falcon.enableTLS}}",
"https_property_value": "true",
"default_port": 15000,
"kerberos_keytab": "{{falcon-startup.properties/*.falcon.http.authentication.kerberos.keytab}}",
"kerberos_principal": "{{falcon-startup.properties/*.falcon.http.authentication.kerberos.principal}}",
"connection_timeout": 5
},
"reporting": {
"ok": {
"text": "HTTP {0} response in {2:.3f}s"
},
"warning": {
"text": "HTTP {0} response from {1} in {2:.3f}s ({3})"
},
"critical": {
"text": "Connection failed to {1} ({3})"
}
}
}
}
Falcon should respect the port, regardless of plaintext vs encryption. However, this way, the alert framework will understand whether to use plaintext or TLS.
... View more
03-09-2017
04:38 PM
The logs indicate that the port on the host for MySQL isn't open. Your CLI tests indicate it is. One of them has to be wrong 🙂 Can you do a grep jdbc /etc/ambari-server/conf/ambari.properties and see if the DB properties look correct?
... View more
03-03-2017
01:40 PM
2 Kudos
Yes, I believe that you can. There is a folder which ships with Ambari Server in /var/lib/ambari-server/resources/custom_actions/scripts. You can have Ambari execute these scripts on the agents. For example, when you create a new cluster, Ambari "checks the hosts" for things like memory, OS, problems. This script is the check_host.py script. It's invoked like: {
"RequestInfo": {
"action": "check_host",
"context": "Check host",
"parameters": {
"check_execute_list": "host_resolution_check",
"jdk_location": "http://192.168.64.1:8080/resources/",
"threshold": "20",
"hosts": "c6401.ambari.apache.org,c6402.ambari.apache.org,c6403.ambari.apache.org"
}
},
"Requests/resource_filters": [
{
"hosts": "c6401.ambari.apache.org,c6402.ambari.apache.org,c6403.ambari.apache.org"
}
]
}
Where "action" is the name of the script. The action is defined in /var/lib/ambari-server/resources/custom_action_definitions/system_action_definitions.xml like so: <actionDefinition>
<actionName>check_host</actionName>
<actionType>SYSTEM</actionType>
<inputs/>
<targetService/>
<targetComponent/>
<defaultTimeout>60</defaultTimeout>
<description>General check for host</description>
<targetType>ANY</targetType>
<permissions>HOST.ADD_DELETE_HOSTS</permissions>
</actionDefinition>
... View more
02-17-2017
12:56 PM
Yes, my example is correct. There is no way to query directly for a specific property; you can only query by bean name. However, for alerts, we use a slash as a delimiter. The metric alert will remove the "VolumeFailuresTotal" and retrieve the "Hadoop:service=NameNode,name=FSNamesystemState" bean. Then it will extract the "VolumeFailuresTotal" metric.
... View more
02-16-2017
10:05 PM
Sure, you'd need to execute a POST to create the new alert: POST api/v1/clusters/<cluster-name>/alert_definitions {
"AlertDefinition": {
"component_name": "NAMENODE",
"description": "This service-level alert is triggered if the total number of volume failures across the cluster is greater than the configured critical threshold.",
"enabled": true,
"help_url": null,
"ignore_host": false,
"interval": 2,
"label": "NameNode Volume Failures",
"name": "namenode_volume_failures",
"scope": "ANY",
"service_name": "HDFS",
"source": {
"jmx": {
"property_list": [
"Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal"
],
"value": "{0}"
},
"reporting": {
"ok": {
"text": "There are {0} volume failures"
},
"warning": {
"text": "There are {0} volume failures",
"value": 1
},
"critical": {
"text": "There are {0} volume failures",
"value": 1
},
"units": "Volume(s)"
},
"type": "METRIC",
"uri": {
"http": "{{hdfs-site/dfs.namenode.http-address}}",
"https": "{{hdfs-site/dfs.namenode.https-address}}",
"https_property": "{{hdfs-site/dfs.http.policy}}",
"https_property_value": "HTTPS_ONLY",
"kerberos_keytab": "{{hdfs-site/dfs.web.authentication.kerberos.keytab}}",
"kerberos_principal": "{{hdfs-site/dfs.web.authentication.kerberos.principal}}",
"default_port": 0,
"connection_timeout": 5,
"high_availability": {
"nameservice": "{{hdfs-site/dfs.internal.nameservices}}",
"alias_key": "{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}",
"http_pattern": "{{hdfs-site/dfs.namenode.http-address.{{ha-nameservice}}.{{alias}}}}",
"https_pattern": "{{hdfs-site/dfs.namenode.https-address.{{ha-nameservice}}.{{alias}}}}"
}
}
}
}
}
This will create a new METRIC alert which runs every 2 minutes.
... View more
02-16-2017
01:45 PM
2 Kudos
It depends on how you want to monitor the failed disks. You can always write your own script alert in Python to monitor the various disks. However, if NameNode has a JMX metric which it exposes for this, you can also create a much simpler metric alert. It seems like Hadoop:service=NameNode,name=NameNodeInfo/LiveNodes contains escaped JSON of every DataNode. Metrics can't help you there, but there is a simpler global failed volume metric: Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal You could try to use that metric to monitor failures. If either of these approaches sound feasible, I can try to point you in the right direction to creating the alert.
... View more
02-14-2017
06:57 PM
Currently no, there is not. I believe there is a Jira open for changing how we send arguments to a script dispatcher so they are parameterized. Some fields, such as host name and component name, are not always present, so the current model simply omits them.
... View more
02-07-2017
02:03 PM
I think this goes back to the whole "dead is bad" theory. If I recall correctly, there was a metric Ambari was monitoring once on HBase - it was for "Dead RegionServers". We incorrectly assumed that "dead" was "bad". Because of this, while decommissioning a RegionServer, alerts would trigger (and not go away for a long time). In the end, it was determined that this metric wasn't really something which needed alerting on. HDFS is a little different - I believe that a DataNode is marked as stale if it hasn't reported in within 30 seconds and marked as dead if it hasn't reported within 1 minute. The problem here is that action is taken by the NameNode in this case - it will begin replicating blocks when it believes a DataNode is dead. So, we alert on it since it's something that is actively causing changes in the cluster data. The NameNode actually has metrics for differentiating "dead" vs "decommissiong dead": "NumLiveDataNodes": 3,
"NumDeadDataNodes": 1,
"NumDecomLiveDataNodes": 0,
"NumDecomDeadDataNodes": 1, In the above example, Ambari won't worry about dead nodes which are marked as known decommissioning, but we will worry about this which are unexpected.
... View more
02-07-2017
01:47 PM
1 Kudo
Can you specify which alert is being triggered? Most likely, it's an alert based on a master service's metrics. For example, if you decommission a DataNode and you place that DataNode into Maintenance Mode, then Ambari won't file alerts for it. However, if the NameNode broadcasts a metric that indicates there's a problem with the liveliness of the DataNodes, then Ambari will display that alert. This is because the master service is running on a separate machine and doesn't care about the maintenance mode of he affected slave. Each service is different - some services understand that a decommission means that the node shouldn't be stale and some still create the metric to indicate staleness for a short period of time.
... View more