Member since
08-20-2018
26
Posts
7
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6488 | 11-27-2018 02:55 PM |
12-13-2021
03:06 AM
1 Kudo
When you install Cloudera Manager you can configure the mail server you will use with the Alert Publisher. However, if you need to change these settings, you can do so under the Alert Publisher section of the Management Services configuration tab. Through Alert Publisher we can get the alerts for a different type of service status like Bad, Warning, Good. Under the Alert Publisher role of the Cloudera Manager Management Service, you can configure email or SNMP delivery of alert notifications and you can also configure a custom script that runs in response to an alert. The information on how to configure the alert can be found in the official documentation of Cloudera. Please check the provided link [1]. However, we will discuss some of the common issues faced while configuring or receiving the alert delivery. Common Issues Over Alert Publisher-> 1- Sometimes we are not able to receive the alert to the SMTP/SNMP server end. While checking the alert-publisher logs, the below trace of ERROR can be seen. ERROR org.apache.camel.processor.DefaultErrorHandler: Failed delivery for (MessageId: ID-uxxxxxxxx on ExchangeId: ID-xxxxxxx). Exhausted after delivery attempt: 1 caught: javax.mail.MessagingException: Exception reading response;
nested exception is:
java.net.SocketTimeoutException: Read timed out To troubleshoot these issues, we need to first verify that the connectivity between the server and SMTP/SNMP host is good. -> telnet <hostname> port If connectivity is good, then we can either enable the debug logs and check for more loggings in alert publisher. How to enable DEBUG-> Cloudera Manager Service -Alert Publisher - Configurations
Under the advanced java configuration for alert publisher, please append the below configuration
"-Djavax.net.debug=all"
Then go to logs
Set the log level to debug in alert publisher We can also verify the tcp dump and check if the message are getting accepted from CM server. For Example-> tcpdump -i any -s 100000 -w ~/alertpub.out port <port> 2- The alert publisher fails to send out the alert due to the unavailability of the credentials information. ERROR org.apache.camel.processor.DefaultErrorHandler: Failed delivery for (MessageId: ID-xxxxxxxxx on ExchangeId: ID-xxxxxx). Exhausted after delivery attempt: 1 caught: javax.mail.AuthenticationFailedException: failed to connect, no password specified?
javax.mail.AuthenticationFailedException: failed to connect, no password specified?
at javax.mail.Service.connect(Service.java:325) Make sure credential information has been given over to the alert publisher configuration. CM > CMS > Configuration > Alerts: Mail Server Hostname, Alerts: Mail Server Username, Alerts: Mail Server Password [1] https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_ag_email.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d1d https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_ag_snmp.html#xd_583c10bfdbd326ba-3ca24a24-13d80143249--7f27 https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_ag_alert_script.html#concept_sfx_lkw_yt
... View more
Labels:
12-13-2021
12:46 AM
@Saraali You need to provide the credentials to get access to the repository. You can follow the documentation [1] for the whole process to install CM-server, agents. To access the binaries at the locations below, you must first have an active subscription agreement and obtain a license key file along with the required authentication credentials (username and password). [1] https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/poc_run_installer.html https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/installation/topics/cdpdc-configure-repository.html Additonal Information-> https://www.cloudera.com/downloads/paywall-expansion.html
... View more
12-13-2021
12:39 AM
1 Kudo
@jh1688 Exit Error Code 126 indicate that there is an issue with the permission of the executable files. If you can check/provide the stderr.log snapshot, it might give you the hint to which file it is trying to execute. Note-> Please accept the answer, if resolve the query
... View more
01-22-2021
02:54 AM
8 Kudos
Ambari Metrics System (AMS) collects, aggregates, and serves Hadoop and system metrics in Ambari-managed clusters. Basically AMS has four components Metrics Monitors, Hadoop Sinks, Metrics Collector, and Grafana. Metrics Monitors on each host in the cluster collect system-level metrics and publish to the Metrics Collector. Hadoop Sinks plug in to Hadoop components to publish Hadoop metrics to the Metrics Collector. The Metrics Collector is a daemon that runs on a specific host in the cluster and receives data from the registered publishers, the Monitors and the Sinks. Grafana is a daemon that runs on a specific host in the cluster and serves pre-built dashboards for visualizing metrics collected in the Metrics Collector. In this article, we will be checking how we can troubleshoot AMS-related issues effectively. There are multiple issues which arises in AMS that leads to different types of discrepancy such as collector crashing, Metrics not available, Grafana startup failure and time-range metric issues in Grafana. We will be checking the step by step process for multiple issues in AMS environment and how we can troubleshoot them effectively. Issues Arises in AMS Collector Not coming up or crashing frequently Metrics Not available in Ambari UI Grafana Metric related issues Collector Not coming up or crashing frequently This is the most general problem with the AMC. There could be multiple reasons for the frequent or intermittent crash of AMC. Here is how we need to approach this step by step and debug to resolve the issue. First check for Binaries of AMS which should be identical as the current Ambari version. [ Verify on all the hosts where metrics monitor is running ] [root@c1236-node4 ~]# rpm -qa|grep ambari-metrics
ambari-metrics-monitor-2.7.5.0-72.x86_64
ambari-metrics-collector-2.7.5.0-72.x86_64
ambari-metrics-hadoop-sink-2.7.5.0-72.x86_64 If not, please upgrade the AMS accordingly. Follow Upgrading Ambari Metrics for the upgrade process. If Zookeeper-related issues being observed in AMC logs and getting below logging. 2018-03-21 15:18:14,996 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop.datalonga.com/10.XX.XX.XX:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-21 15:18:14,997 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) Check collector-gc.log and gc.log files. If heap space is getting occupied completely or almost full. [ Underlined part ]. If similar is getting observed, we need to increase the heap space for collector. 2020-09-28T06:27:47.846+0000: 503090.803: [GC (Allocation Failure) 2020-09- 28T06:27:47.846+0000: 503090.804: [ParNew: 145749K->1896K(157248K), 0.0099317 secs] 506788K->506400K(506816K), 0.0103397 secs] [Times: user=0.13 sys=0.03, real=0.01 secs] We can also clear up Zookeeper data and later restart AMS from Ambari. For more information, check Cleaning up Ambari Metrics System Data Sometimes in logs we see below error which tells that the default port 6188 is being already occupied. INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules 2017-08-04 15:30:36,965 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.BindException: Port in use: xxxxxx:6188at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919) at In such scenario, Check which process is using that port before start of AMC. netstat -tnlpa | grep 6188 Go to Ambari UI > Ambari Metrics > Configs (tab) > Advanced (child tab) > navigate to Advanced ams-site and search for the following property: timeline.metrics.service.webapp.address = 0.0.0.0:6188 You can change that port to something else to avoid port conflict. then try restarting AMS. Most of the cases, we will observe the ASYNC_PROCESS logging in AMC log. In that case, check the hbase-ams-master-<hostname>.log [ EMBEDDED Mode ] and hbase-ams-regionserver-<hostname>.log file [ DISTRIBUTED Mode]. You will observe the following log lines frequently in these logs. WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4058ms GC pool 'ParNew' had collection(s): count=1 time=2415ms
WARN [RpcServer.FifoWFPBQ.default.handler=32,queue=2,port=61320] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)",
"starttimems":1609769073845,"responsesize":739217,"method":"Multi","processingtimems":10003,"client":"10.118.5.94:51114","queuetimems":0,"class":"HRegionServer"} WARN [7,queue=0,port=16020] regionserver.RSRpcServices - Large batch operation detected
(greater than 5000) (HBASE-18023). Requested Number of Rows: 7096 Client: xxx In such cases, Check if services are generating way more than metrics which AMC is not able to handle at the current configuration. To rectify these type of issues, we can use below pointers to stabilize the collector. Check the metadata output and identify which all services are generating (>15k-20k) metrics. http://<ams-host>:6188/ws/v1/timeline/metrics/metadata We can increase the size of heap space of region-server (In case of Distributed) and hbase- master (In case of Embedded). [ Check the gc.log and collector.log file to help understand the current GC utilization]. If the services are generating large number of metrics, we can limit them by implementing whitelisting or blacklisting and check if AMS is getting stabilized. Ambari Metrics - Whitelisting In some cases, we will be seeing the following logs in region-server logs, which indicate that there is some issue with the region-server opening. ERROR [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=61320] regionserver.HRegionServer: Received CLOSE for a region which is not online, and we're not opening. To mitigate such issues, try the following steps and check the status of AMC. Connect with zookeeper CLI->
/usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf zkcli
Remove all the znode->
rmr /ams-hbase-secure
Create a new znode in AMS configuration -> zookeeper.znode.parent and restart AMS Metrics not available in Ambari UI > If metrics are not showing on Ambari UI, then check if there is any present issues with the collector. I am providing some of the known issues for the latest version of Ambari where graphs are not available even though AMC is running fine. In latest version of Ambari-2.7.3 and above, we have observed the NIFI, Ambari_Metrics and Kafka does not show the metrics. Here are the workaround to mitigate the issue. a) NIFI->
1) vi /var/lib/ambari-server/resources/common-services/NIFI/1.0.0/metainfo.xml
Change <timelineAppid>NIFI</timelineAppid> to <timelineAppid>nifi</timelineAppid> at two places.
2) Replace "<timelineAppid>NIFI</timelineAppid>" with "<timelineAppid>nifi</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/ hdf-ambari-mpack-3.5.1.*/common-services/NIFI/1.0.0/metainfo.xml file
b) Ambari_Metrics:->
Replace "<timelineAppid>AMS-HBASE</timelineAppid>" with "<timelineAppid>ams-hbase</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.2.b/services/AMBARI_METRICS/metainfo.xml file
c) Kafka:->
add "<timelineAppid>kafka_broker</timelineAppid>" after "<name>KAFKA_BROKER</name>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.3/services/KAFKA/metainfo.xml file
After above Change restart ambari-server. Grafana Metric related issues > Sometimes we see multiple issues with the Grafana such as Metrics not available, Time-range metrics are not being shown, inaccurate information in graphs. To check the Grafana issues, perform the following sequence. Check if AMC is running fine and you could see the aggregation happening in AMC logs. 2021-01-22 09:22:23,051 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Jan 22 09:22:23 UTC 2021
2021-01-22 09:23:20,780 INFO TimelineClusterAggregatorSecond: Started Timeline aggregator thread @ Fri Jan 22 09:23:20 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last Checkpoint read : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Rounded off checkpoint : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last check point time: 1611307200000, lagBy: 200 seconds.
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Start aggregation cycle @ Fri Jan 22 09:23:20 UTC 2021, startTime = Fri Jan 22 09:20:00 UTC 2021, endTime = Fri Jan 22 09:22:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Skipping aggregation for metric patterns : sdisk\_%,boottime
2021-01-22 09:23:23,129 INFO TimelineClusterAggregatorSecond: Saving 23764 metric aggregates.
2021-01-22 09:23:23,868 INFO TimelineClusterAggregatorSecond: End aggregation cycle @ Fri Jan 22 09:23:23 UTC 2021 2. If data is not present in all the graphs then check for the metadata output and look for the services metrics in meta output. 3. If data is not present in only few graphs then please check the meta-output for that particular service. Example: If data is not present over hive-server2 metrics: curl -v --insecure https://<ams-host>:6188/ws/v1/timeline/metrics/metadata?appId=hiveserver2 | python -m json.tool > hiveserver2.txt 4. Check if any whitelisting or blacklisting has been applied in the configuration, which might be stopping AMS to process those metrics. Also check for any configuration such as "Disable Minute host aggregator" is enabled in configs. 5. There are some known issues on the latest version of Ambari over grafana. I am listing few of them here. AMBARI-25570 AMBARI-25563 AMBARI-25383 AMBARI-25457 Happy Learning!!!!!!!
... View more
Labels:
02-04-2019
07:08 AM
2 Kudos
This article is related to creating SNMP alerts through the custom script and how to troubleshoot.
Install SNMP on sandbox or local environment. yum install net-snmp net-snmp-utils net-snmp-libs –y
Change the script /etc/snmp/snmptrapd.conf file and include "disableAuthorization yes". # Example configuration file for snmptrapd
#
# No traps are handled by default, you must edit this file!
#
# authCommunity log,execute,net public
# traphandle SNMPv2-MIB::coldStart /usr/bin/bin/my_great_script cold
disableAuthorization yes
To understand why this change required, Refer to the Access Control section of this link.
Copy APACHE-AMBARI-MIB.txt file to /usr/share/snmp/mibs folder.
cp /var/lib/ambari-server/resources/APACHE-AMBARI-MIB.txt /usr/share/snmp/mibs
Note: Ensure it has proper permission
Startup a simple SNMP trap daemon to log traps to the /tmp/traps.log file for testing purposes. nohup snmptrapd -m ALL -A -n -Lf /tmp/traps.log &
Invoke a test trap to ensure that the snmptrapd is logging appropriately to /tmp/traps.log and the Apache Ambari MIB is being respected. snmptrap -v 2c -c public localhost '' APACHE-AMBARI-MIB::apacheAmbariAlert alertDefinitionName s "definitionName" alertDefinitionHash s "definitionHash" alertName s "name" alertText s "text" alertState i 0 alertHost s "host" alertService s "service" alertComponent s "component"
You should be able to see the following traps in /tmp/traps.log. 2019-02-04 06:24:38 UDP: [127.0.0.1]:59238->[127.0.0.1]:162 [UDP: [127.0.0.1]:59238->[127.0.0.1]:162]:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (224395199) 25 days, 23:19:11.99SNMPv2-MIB::snmpTrapOID.0 = OID: APACHE-AMBARI-MIB::apacheAmbariAlertAPACHE-AMBARI-MIB::alertDefinitionName = STRING: "definitionName"APACHE-AMBARI-MIB::alertDefinitionHash = STRING: "definitionHash"APACHE-AMBARI-MIB::alertName = STRING: "name"APACHE-AMBARI-MIB::alertText = STRING: "text"APACHE-AMBARI-MIB::alertState = INTEGER: ok(0)APACHE-AMBARI-MIB::alertHost = STRING: "host"APACHE-AMBARI-MIB::alertService = STRING: "service"APACHE-AMBARI-MIB::alertComponent = STRING: "component"
Now, we will be creating the script that will be used by Ambari for sending SNMP traps. Create a file that contains the script, named /tmp/snmp_mib_script.sh. in this example. It is recommended to create this file in a more permanent directory for actual use. Format of the Alert script: #!/bin/bash
HOST=localhost
COMMUNITY=public
STATE=0
if [[ $4 == "OK" ]]; then
STATE=0
elif [[ $4 == "UNKNOWN" ]]; then
STATE=1
elif [[ $4 == "WARNING" ]]; then
STATE=2
elif [[ $4 == "CRITICAL" ]]; then
STATE=3
fi
/usr/bin/snmptrap -v 2c \
-c $COMMUNITY $HOST '' APACHE-AMBARI-MIB::apacheAmbariAlert \
alertDefinitionId i 0 \
alertDefinitionName s "$1" \
alertDefinitionHash s "n/a" \
alertName s "$2" \
alertText s "$5" \
alertState i $STATE \
alertHost s `hostname` \
alertService s "$3"
Note: Ensure to change the host with the desired sandbox or host where you want to send the traps.
Reference: ambari-commits mailing list archives
Add the following line to the /etc/ambari-server/conf/ambari.properties file: org.apache.ambari.contrib.snmp.script=/tmp/snmp_mib_script.sh
Restart the Ambari-server
Now, we will use the following API call to add an alert target for the script. curl -u "admin_user":"admin_password" -H 'X-Requested-By: ambari' http://<Ambari_host>:<PORT>/api/v1/alert_targets -d '{ "AlertTarget" : { "name" : "SNMP_MIB", "description" : "SNMP MIB Target", "notification_type" : "ALERT_SCRIPT","global": true,"properties": {"ambari.dispatch-property.script": "org.apache.ambari.contrib.snmp.script"}}}]}' If you want to update the alert_states as critical only, you can add below lines in script. {
"AlertTarget": {
"alert_states": ["CRITICAL"]
}
}
You can check the alert notification that has been created in Ambari UI. Note: Check if any snmptrapd process is currently running. If so, cancel that process. Example: # ps -ef|grep snmptrapd
root 15286 15087 0 06:49 pts/0 00:00:00 grep --color=auto snmptrapd
root 21610 1 0 Jan24 ? 00:00:59 snmptrapd -m ALL -A -n -Lf /tmp/traps.log
#kill -9 21610
Troubleshooting
Whenever any alert will be triggered you will be able to see below lines in Ambari-server.log. INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are xx pending alert notices about to be dispatched...
This means Ambari is successfully sending the SNMP traps to the mentioned host.
At the same timestamp, you can check the entry in ambari-alert.log file.
For any change in alert state, there will be an entry that will get recorded in the alert_notice table in the database. If there is no entry available, check the notification set in alert_target table.
Hope this article will help!!!!!
Reference: https://github.com/apache/ambari/tree/trunk/contrib/alert-snmp-mib
... View more
Labels:
12-10-2018
07:00 AM
3 Kudos
When an ambari agent starts, it bootstraps with the ambari server via registration. The server sends information to the agent about the components that have been enabled for auto start along with the other auto start properties in ambari.properties. The agent compares the current state of these components against the desired state, to determine if these components are to be installed, started, restarted or stopped. These are the values ambari-server will send to ambari-agent by default unless configured in cluster-env.xml file. "recovery_max_count": "6", "recovery_lifetime_max_count": "1024", "recovery_type": "AUTO_START", "recovery_window_in_minutes": "60", recovery_lifetime_max_count ---- The maximum number of recovery attempts of a failed component during the lifetime of an Ambari Agent instance. This is reset when the Ambari Agent is restarted. recovery_window_in_minutes -- The length of a recovery window, in minutes, in which recovery attempts can be retried. recovery_max_count --- The maximum number of recovery attempts of a failed component during a specified recovery window. recovery_type ---- The type of automatic recovery of failed services and components to use. The following are examples of valid values for recovery_type Attribute: recovery_type Commands State Transitions AUTO_START Start INSTALLED → STARTED FULL Install, Start, Restart, Stop INIT → INSTALLED, INIT → STARTED, INSTALLED → STARTED, STARTED → STARTED, STARTED → INSTALLED DEFAULT None Auto start feature disabled For Example:-- If you want your host component not to auto start whenever your VM crashes and reboots then you have to change the "recovery_type": "AUTO_START" to "recovery_type": "DEFAULT"(Auto start feature disabled). Similarly if you want to decrease the number of recovery attempt of any failed component, Then you have to change the value of recovery_max_count accordingly. Hope this article will help!!!!! Reference:-- https://cwiki.apache.org/confluence/display/AMBARI/Recovery%3A+auto+start+components
... View more
Labels:
11-29-2018
01:00 PM
@Krishna Venkata To access the NIFI through SSL you need to load the user certificate in your browser. Then you will be able to access the NIFI UI. To understand the procedure to do the same. Kindly follow the link below. https://www.batchiq.com/nifi-configuring-ssl-auth.html If it resolve your query. Kindly accept my answer. Hope it helps!!!!!!!!
... View more
11-28-2018
05:30 AM
@Kei Miyauchi, Great it worked!!!! Kindly accept my previous answer. Regarding your query, Which all components you are using to authenticate via knox.
... View more
11-27-2018
02:55 PM
@Kei Miyauchi, Do you see any error like this in your ambari-server logs and ambari-agent logs. 30 Oct 2018 17:12:14,908 WARN [ambari-client-thread-8243] JwtAuthenticationFilter:381 - JWT expiration date validation failed. 30 Oct 2018 17:12:14,910 WARN [ambari-client-thread-8243] JwtAuthenticationFilter:173 - JWT authentication failed - Invalid JWT token 30 Oct 2018 17:12:19,922 WARN [ambari-client-thread-8243] JwtAuthenticationFilter:381 - JWT expiration date validation failed. 30 Oct 2018 17:12:19,922 WARN [ambari-client-thread-8243] JwtAuthenticationFilter:173 - JWT authentication failed - Invalid JWT token. If this is the case then you should check the knosso.token.ttl property. This you can find in Ambari > Knox > Configs > Advanced knoxsso-topology. knosso.token.ttl should be 30 seconds by default. checkout the below kb article. https://community.hortonworks.com/content/supportkb/223278/errorjwt-authentication-failed-invalid-jwt-token-w.html and if this is not the issue then can you please upload the ambari-sever logs ambari-audit logs. Hope this helps!!!!!!!!
... View more
11-27-2018
05:52 AM
You can check the HDP version from Ambari UI. for 2.7 and Higher Version : Stack and Versions->Versions for 2.6 or Lower Version : Admin->Stack and Versions->Versions From Command Line try these commands to check the HDP version. hdp-select rpm -qa|grep hadoop
... View more