Member since
10-14-2015
165
Posts
63
Kudos Received
27
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1114 | 12-11-2018 03:42 PM | |
820 | 04-13-2018 09:17 PM | |
681 | 02-08-2018 06:34 PM | |
1877 | 01-24-2018 02:18 PM | |
3031 | 10-11-2017 07:27 PM |
06-29-2017
12:22 PM
The ID and the definition name are required fields when dealing which alerts, and are always returned. They don't hurt anything if you're not using them and don't incur a cost when retrieving them, so it's really not really an issue if you just ignore them. All dates in Ambari are returned in the Java epoch, which is the number of milliseconds since 1970. There are many tools which can convert these for you in variety of languages and environments.
... View more
06-27-2017
06:25 PM
If the DN is indeed going down, an alert should trigger as well. Can you post your DN log here in its entirety so we can see why it might be failing?
... View more
06-27-2017
12:51 PM
1 Kudo
These are usually caused by the alerts framework doing a port check on the DataNode - any unknown wire communication causes them to dump out an exception - they're harmless. What makes you think that the DataNode is actually going down? If it was, you'd see it shutting down in the logs.
... View more
06-21-2017
09:37 PM
2 Kudos
You can use a script dispatcher to essentially add all kinds of custom functionality to Ambari alerts. The idea is that instead of sending an Email or an SNMP trap, Ambari would invoke a script that you wrote with the parameters of the alert. Your script could then contact any 3rd party system and do things like create tickets. https://cwiki.apache.org/confluence/display/AMBARI/Creating+a+Script-based+Alert+Dispatcher
... View more
06-14-2017
08:56 PM
In that case, I don't think you'll be able to vary the sender and the reply-to addresses. Is there a reason that you need to, though? Can you simply use the existing mail.smtp.from property to specify your reply address?
... View more
06-14-2017
06:50 PM
Ambari uses JavaMail under the hood to send SNMP messages. I think what you're asking is for the Sender/From and Reply-To addresses to be different. Although JavaMail does support this, I don't believe it supports this using properties (at least it's not documented that you can). Where you could set "mail.smtp.from", you cannot set a different reply address in a similar manner. The JavaMail code, however, does have reference to a "mail.reply.to" property which you could try.
... View more
06-09-2017
12:46 PM
Aggregation of this data is really a metrics-related concern. The alerts framework won't aggregate data for you. It sounds like you've defined a host-alert which can get the CPU usage of each DataNode. That host alert can trigger if the CPU usage is above a set threshold. You can create an "AGGREGATE" alert type which will essentially look for a percentage of problems across the cluster. For example, let's say you define a host-level alert and have it set to be 80% for WARN and 90% for critical. You would define an AGGREGATE alert which says that when X alerts are triggered, I will trigger. If you set this to be 20%, then in a cluster of 10 hosts, when 2 of them have CPU levels above the thresholds, it will fire. If you wanted to have a single alert against an aggregate value, then you'd need to feed the CPU usage data into Ambari Metrics somehow. Once it's in there, you could query it with another custom alert.
... View more
06-08-2017
12:50 PM
1 Kudo
Can you provide more details about what you want to be alerting on? When you say "script data from your datanodes", what are you referring to? In general, the only data which can be passed to a custom script alert are the configurations of the cluster. The script alert would then take whatever configurations it needs and then check "something". If you need to pull data from all of your DataNodes, that could be quite a bit of work depending on the size of the cluster.
... View more
05-30-2017
02:37 PM
UNKNOWN alerts happen when data like metrics can't be retrieved. That's what is happening here. The fact that you have a CRITICAL alert for the DataNode Web UI indicates that the DataNode is down.
... View more
05-25-2017
01:40 PM
You'll need to re-generate certificates on the Ambari Server since they are expired: https://community.hortonworks.com/articles/68799/steps-to-fix-ambari-server-agent-expired-certs.html
... View more
05-25-2017
12:34 PM
Heartbeats can be lost if an exception occurs while Ambari Server is handling the heartbeat. It can also happen if there is an SSL certificate issue between server and agent. Can you please attach the ambari-server log and a log from the ambari-agent?
... View more
05-19-2017
01:31 PM
If you could paste the exact error message you're getting that would help. Ambari uses JavaMail to send alerts via SMTP. When creating/editing a notification in the UI, there's a section all the way at the bottom where you can "Add Property" - it should be right below the "TLS" checkbox. Here, you can supply any JavaMail property you need. For example, you could add the property "mail.smtp.ssl.trust" set to the value of "*" (without quotes)
... View more
04-24-2017
11:35 PM
You're hitting an issue with Ambari Server Upgrade from 2.4.2 to 2.5.0.3 - as part of this upgrade, we need to drop and re-create the primary key on the hostcomponentdesiredstate table. The error you're getting indicates that the primary key already exists and thus can't be added again. In your logs, you might see something like this statement: Unable to determine the primary key constraint name for hostcomponentdesiredstate I'd like to know why this might be happening (could be an artifact of how your Oracle DB is installed). In any event, you should be able to correct this by hand and re-run the upgrade:
ALTER TABLE hostcomponentdesiredstate DROP CONSTRAINT PK_hostcomponentdesiredstate;
ALTER TABLE hostcomponentdesiredstate ADD CONSTRAINT PK_hostcomponentdesiredstate PRIMARY KEY (id); Now you can retry "ambari-server upgrade"
... View more
04-12-2017
03:54 PM
To help you, we'd need some more information: - Which version of HDP are you actually running on currently? Is it 2.5.3.0-37? - Can you post the entire output from the install command? - What is the content of /usr/hdp on this host which is having trouble?
... View more
04-11-2017
01:58 PM
1 Kudo
It looks like there might be a problem with the repository you are using. This error suggests that hdp-select doesn't exist in your repo: resource_management.core.exceptions.ExecutionFailed:Execution of '/usr/bin/yum -d 0 -e 0 -y install hdp-select' returned 1.Error:Nothing to do Can you verify that /etc/yum.repos.d/HDP.repo exists and has the correct repository listed? You can also try a "yum clean all" on that host.
... View more
03-27-2017
12:58 PM
Props goes to @Nate POST api/v1/clusters/<YOUR-CLUSTER-NAME>/requests {
"RequestInfo": {
"command": "RESTART",
"context": "Restart all ZK on the selected hosts",
"operation_level": {
"level": "HOST",
"cluster_name": "YOUR-CLUSTER-NAME"
}
},
"Requests/resource_filters": [
{
"service_name": "ZOOKEEPER",
"component_name": "ZOOKEEPER_CLIENT",
"hosts_predicate": "HostRoles/component_name=ZOOKEEPER_CLIENT"
}
]
}
... View more
03-09-2017
11:50 PM
Although this will technically work, there is a supported way of doing this. The Falcon alert definition can specify the parameter to monitor for determining whether to use HTTP or HTTPS: {
"name": "falcon_server_webui",
"label": "Falcon Server Web UI",
"description": "This host-level alert is triggered if the Falcon Server Web UI is unreachable.",
"interval": 1,
"scope": "ANY",
"enabled": true,
"source": {
"type": "WEB",
"uri": {
"http": "{{falcon-env/falcon_port}}",
"https": "{{falcon-env/falcon_port}}",
"https_property": "{{hdfs-site/falcon.enableTLS}}",
"https_property_value": "true",
"default_port": 15000,
"kerberos_keytab": "{{falcon-startup.properties/*.falcon.http.authentication.kerberos.keytab}}",
"kerberos_principal": "{{falcon-startup.properties/*.falcon.http.authentication.kerberos.principal}}",
"connection_timeout": 5
},
"reporting": {
"ok": {
"text": "HTTP {0} response in {2:.3f}s"
},
"warning": {
"text": "HTTP {0} response from {1} in {2:.3f}s ({3})"
},
"critical": {
"text": "Connection failed to {1} ({3})"
}
}
}
}
Falcon should respect the port, regardless of plaintext vs encryption. However, this way, the alert framework will understand whether to use plaintext or TLS.
... View more
03-09-2017
04:38 PM
The logs indicate that the port on the host for MySQL isn't open. Your CLI tests indicate it is. One of them has to be wrong 🙂 Can you do a grep jdbc /etc/ambari-server/conf/ambari.properties and see if the DB properties look correct?
... View more
03-03-2017
01:40 PM
2 Kudos
Yes, I believe that you can. There is a folder which ships with Ambari Server in /var/lib/ambari-server/resources/custom_actions/scripts. You can have Ambari execute these scripts on the agents. For example, when you create a new cluster, Ambari "checks the hosts" for things like memory, OS, problems. This script is the check_host.py script. It's invoked like: {
"RequestInfo": {
"action": "check_host",
"context": "Check host",
"parameters": {
"check_execute_list": "host_resolution_check",
"jdk_location": "http://192.168.64.1:8080/resources/",
"threshold": "20",
"hosts": "c6401.ambari.apache.org,c6402.ambari.apache.org,c6403.ambari.apache.org"
}
},
"Requests/resource_filters": [
{
"hosts": "c6401.ambari.apache.org,c6402.ambari.apache.org,c6403.ambari.apache.org"
}
]
}
Where "action" is the name of the script. The action is defined in /var/lib/ambari-server/resources/custom_action_definitions/system_action_definitions.xml like so: <actionDefinition>
<actionName>check_host</actionName>
<actionType>SYSTEM</actionType>
<inputs/>
<targetService/>
<targetComponent/>
<defaultTimeout>60</defaultTimeout>
<description>General check for host</description>
<targetType>ANY</targetType>
<permissions>HOST.ADD_DELETE_HOSTS</permissions>
</actionDefinition>
... View more
02-28-2017
02:41 PM
I think that the cluster must already be installed for the cluster name to show up in those files. Once it's installed, any call to recommendations should place it in hosts.json.
... View more
02-24-2017
02:05 PM
1 Kudo
It looks like the cluster name is stored in the hosts.json file. You should be able to access it like this: for host in hosts["items"]:
cluster_name = host["Hosts"]["cluster_name"]:
... View more
02-17-2017
12:56 PM
Yes, my example is correct. There is no way to query directly for a specific property; you can only query by bean name. However, for alerts, we use a slash as a delimiter. The metric alert will remove the "VolumeFailuresTotal" and retrieve the "Hadoop:service=NameNode,name=FSNamesystemState" bean. Then it will extract the "VolumeFailuresTotal" metric.
... View more
02-16-2017
10:05 PM
Sure, you'd need to execute a POST to create the new alert: POST api/v1/clusters/<cluster-name>/alert_definitions {
"AlertDefinition": {
"component_name": "NAMENODE",
"description": "This service-level alert is triggered if the total number of volume failures across the cluster is greater than the configured critical threshold.",
"enabled": true,
"help_url": null,
"ignore_host": false,
"interval": 2,
"label": "NameNode Volume Failures",
"name": "namenode_volume_failures",
"scope": "ANY",
"service_name": "HDFS",
"source": {
"jmx": {
"property_list": [
"Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal"
],
"value": "{0}"
},
"reporting": {
"ok": {
"text": "There are {0} volume failures"
},
"warning": {
"text": "There are {0} volume failures",
"value": 1
},
"critical": {
"text": "There are {0} volume failures",
"value": 1
},
"units": "Volume(s)"
},
"type": "METRIC",
"uri": {
"http": "{{hdfs-site/dfs.namenode.http-address}}",
"https": "{{hdfs-site/dfs.namenode.https-address}}",
"https_property": "{{hdfs-site/dfs.http.policy}}",
"https_property_value": "HTTPS_ONLY",
"kerberos_keytab": "{{hdfs-site/dfs.web.authentication.kerberos.keytab}}",
"kerberos_principal": "{{hdfs-site/dfs.web.authentication.kerberos.principal}}",
"default_port": 0,
"connection_timeout": 5,
"high_availability": {
"nameservice": "{{hdfs-site/dfs.internal.nameservices}}",
"alias_key": "{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}",
"http_pattern": "{{hdfs-site/dfs.namenode.http-address.{{ha-nameservice}}.{{alias}}}}",
"https_pattern": "{{hdfs-site/dfs.namenode.https-address.{{ha-nameservice}}.{{alias}}}}"
}
}
}
}
}
This will create a new METRIC alert which runs every 2 minutes.
... View more
02-16-2017
01:45 PM
2 Kudos
It depends on how you want to monitor the failed disks. You can always write your own script alert in Python to monitor the various disks. However, if NameNode has a JMX metric which it exposes for this, you can also create a much simpler metric alert. It seems like Hadoop:service=NameNode,name=NameNodeInfo/LiveNodes contains escaped JSON of every DataNode. Metrics can't help you there, but there is a simpler global failed volume metric: Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal You could try to use that metric to monitor failures. If either of these approaches sound feasible, I can try to point you in the right direction to creating the alert.
... View more
02-14-2017
06:57 PM
Currently no, there is not. I believe there is a Jira open for changing how we send arguments to a script dispatcher so they are parameterized. Some fields, such as host name and component name, are not always present, so the current model simply omits them.
... View more
02-07-2017
02:03 PM
I think this goes back to the whole "dead is bad" theory. If I recall correctly, there was a metric Ambari was monitoring once on HBase - it was for "Dead RegionServers". We incorrectly assumed that "dead" was "bad". Because of this, while decommissioning a RegionServer, alerts would trigger (and not go away for a long time). In the end, it was determined that this metric wasn't really something which needed alerting on. HDFS is a little different - I believe that a DataNode is marked as stale if it hasn't reported in within 30 seconds and marked as dead if it hasn't reported within 1 minute. The problem here is that action is taken by the NameNode in this case - it will begin replicating blocks when it believes a DataNode is dead. So, we alert on it since it's something that is actively causing changes in the cluster data. The NameNode actually has metrics for differentiating "dead" vs "decommissiong dead": "NumLiveDataNodes": 3,
"NumDeadDataNodes": 1,
"NumDecomLiveDataNodes": 0,
"NumDecomDeadDataNodes": 1, In the above example, Ambari won't worry about dead nodes which are marked as known decommissioning, but we will worry about this which are unexpected.
... View more
02-07-2017
01:47 PM
1 Kudo
Can you specify which alert is being triggered? Most likely, it's an alert based on a master service's metrics. For example, if you decommission a DataNode and you place that DataNode into Maintenance Mode, then Ambari won't file alerts for it. However, if the NameNode broadcasts a metric that indicates there's a problem with the liveliness of the DataNodes, then Ambari will display that alert. This is because the master service is running on a separate machine and doesn't care about the maintenance mode of he affected slave. Each service is different - some services understand that a decommission means that the node shouldn't be stale and some still create the metric to indicate staleness for a short period of time.
... View more
02-06-2017
01:19 PM
You can define an alert dispatcher (a python script) which Ambari will invoke when alerts fire: https://cwiki.apache.org/confluence/display/AMBARI/Creating+a+Script-based+Alert+Dispatcher
... View more
02-06-2017
01:19 PM
1 Kudo
If you're asking if you can take Ambari Alerts and publish them using PutSlack, I think what you're looking for is a script dispatcher: https://cwiki.apache.org/confluence/display/AMBARI/Creating+a+Script-based+Alert+Dispatcher When Ambari triggers an alert, it can invoke a custom python script. We used to use this for dispatching SNMP notifications before it was its own supported type. But you could use it for anything, including pushing data to a URL.
... View more
02-01-2017
12:58 PM
The Alert History endpoint can provide you what you need: https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/alerts.md#alert-history This allows you to query for alerts through the REST API by name, service, criticality, etc.
... View more