Member since
08-20-2018
26
Posts
7
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6638 | 11-27-2018 02:55 PM |
12-16-2021
11:57 AM
@Saraali Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.
... View more
12-13-2021
03:06 AM
1 Kudo
When you install Cloudera Manager you can configure the mail server you will use with the Alert Publisher. However, if you need to change these settings, you can do so under the Alert Publisher section of the Management Services configuration tab. Through Alert Publisher we can get the alerts for a different type of service status like Bad, Warning, Good. Under the Alert Publisher role of the Cloudera Manager Management Service, you can configure email or SNMP delivery of alert notifications and you can also configure a custom script that runs in response to an alert. The information on how to configure the alert can be found in the official documentation of Cloudera. Please check the provided link [1]. However, we will discuss some of the common issues faced while configuring or receiving the alert delivery. Common Issues Over Alert Publisher-> 1- Sometimes we are not able to receive the alert to the SMTP/SNMP server end. While checking the alert-publisher logs, the below trace of ERROR can be seen. ERROR org.apache.camel.processor.DefaultErrorHandler: Failed delivery for (MessageId: ID-uxxxxxxxx on ExchangeId: ID-xxxxxxx). Exhausted after delivery attempt: 1 caught: javax.mail.MessagingException: Exception reading response;
nested exception is:
java.net.SocketTimeoutException: Read timed out To troubleshoot these issues, we need to first verify that the connectivity between the server and SMTP/SNMP host is good. -> telnet <hostname> port If connectivity is good, then we can either enable the debug logs and check for more loggings in alert publisher. How to enable DEBUG-> Cloudera Manager Service -Alert Publisher - Configurations
Under the advanced java configuration for alert publisher, please append the below configuration
"-Djavax.net.debug=all"
Then go to logs
Set the log level to debug in alert publisher We can also verify the tcp dump and check if the message are getting accepted from CM server. For Example-> tcpdump -i any -s 100000 -w ~/alertpub.out port <port> 2- The alert publisher fails to send out the alert due to the unavailability of the credentials information. ERROR org.apache.camel.processor.DefaultErrorHandler: Failed delivery for (MessageId: ID-xxxxxxxxx on ExchangeId: ID-xxxxxx). Exhausted after delivery attempt: 1 caught: javax.mail.AuthenticationFailedException: failed to connect, no password specified?
javax.mail.AuthenticationFailedException: failed to connect, no password specified?
at javax.mail.Service.connect(Service.java:325) Make sure credential information has been given over to the alert publisher configuration. CM > CMS > Configuration > Alerts: Mail Server Hostname, Alerts: Mail Server Username, Alerts: Mail Server Password [1] https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_ag_email.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7d1d https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_ag_snmp.html#xd_583c10bfdbd326ba-3ca24a24-13d80143249--7f27 https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_ag_alert_script.html#concept_sfx_lkw_yt
... View more
Labels:
12-13-2021
12:39 AM
1 Kudo
@jh1688 Exit Error Code 126 indicate that there is an issue with the permission of the executable files. If you can check/provide the stderr.log snapshot, it might give you the hint to which file it is trying to execute. Note-> Please accept the answer, if resolve the query
... View more
08-05-2021
07:18 AM
Hi @gyadav , I have configured the knox-sso for ranger,hdfs,yarn ui but getting the username and password is incorrect error.I have checked knox-audit log and also ambari logs but not able to find root cause and hdp env is 3.0.1 Thanks in advance.
... View more
01-22-2021
02:54 AM
8 Kudos
Ambari Metrics System (AMS) collects, aggregates, and serves Hadoop and system metrics in Ambari-managed clusters. Basically AMS has four components Metrics Monitors, Hadoop Sinks, Metrics Collector, and Grafana. Metrics Monitors on each host in the cluster collect system-level metrics and publish to the Metrics Collector. Hadoop Sinks plug in to Hadoop components to publish Hadoop metrics to the Metrics Collector. The Metrics Collector is a daemon that runs on a specific host in the cluster and receives data from the registered publishers, the Monitors and the Sinks. Grafana is a daemon that runs on a specific host in the cluster and serves pre-built dashboards for visualizing metrics collected in the Metrics Collector. In this article, we will be checking how we can troubleshoot AMS-related issues effectively. There are multiple issues which arises in AMS that leads to different types of discrepancy such as collector crashing, Metrics not available, Grafana startup failure and time-range metric issues in Grafana. We will be checking the step by step process for multiple issues in AMS environment and how we can troubleshoot them effectively. Issues Arises in AMS Collector Not coming up or crashing frequently Metrics Not available in Ambari UI Grafana Metric related issues Collector Not coming up or crashing frequently This is the most general problem with the AMC. There could be multiple reasons for the frequent or intermittent crash of AMC. Here is how we need to approach this step by step and debug to resolve the issue. First check for Binaries of AMS which should be identical as the current Ambari version. [ Verify on all the hosts where metrics monitor is running ] [root@c1236-node4 ~]# rpm -qa|grep ambari-metrics
ambari-metrics-monitor-2.7.5.0-72.x86_64
ambari-metrics-collector-2.7.5.0-72.x86_64
ambari-metrics-hadoop-sink-2.7.5.0-72.x86_64 If not, please upgrade the AMS accordingly. Follow Upgrading Ambari Metrics for the upgrade process. If Zookeeper-related issues being observed in AMC logs and getting below logging. 2018-03-21 15:18:14,996 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop.datalonga.com/10.XX.XX.XX:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-21 15:18:14,997 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) Check collector-gc.log and gc.log files. If heap space is getting occupied completely or almost full. [ Underlined part ]. If similar is getting observed, we need to increase the heap space for collector. 2020-09-28T06:27:47.846+0000: 503090.803: [GC (Allocation Failure) 2020-09- 28T06:27:47.846+0000: 503090.804: [ParNew: 145749K->1896K(157248K), 0.0099317 secs] 506788K->506400K(506816K), 0.0103397 secs] [Times: user=0.13 sys=0.03, real=0.01 secs] We can also clear up Zookeeper data and later restart AMS from Ambari. For more information, check Cleaning up Ambari Metrics System Data Sometimes in logs we see below error which tells that the default port 6188 is being already occupied. INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules 2017-08-04 15:30:36,965 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException java.net.BindException: Port in use: xxxxxx:6188at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919) at In such scenario, Check which process is using that port before start of AMC. netstat -tnlpa | grep 6188 Go to Ambari UI > Ambari Metrics > Configs (tab) > Advanced (child tab) > navigate to Advanced ams-site and search for the following property: timeline.metrics.service.webapp.address = 0.0.0.0:6188 You can change that port to something else to avoid port conflict. then try restarting AMS. Most of the cases, we will observe the ASYNC_PROCESS logging in AMC log. In that case, check the hbase-ams-master-<hostname>.log [ EMBEDDED Mode ] and hbase-ams-regionserver-<hostname>.log file [ DISTRIBUTED Mode]. You will observe the following log lines frequently in these logs. WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4058ms GC pool 'ParNew' had collection(s): count=1 time=2415ms
WARN [RpcServer.FifoWFPBQ.default.handler=32,queue=2,port=61320] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)",
"starttimems":1609769073845,"responsesize":739217,"method":"Multi","processingtimems":10003,"client":"10.118.5.94:51114","queuetimems":0,"class":"HRegionServer"} WARN [7,queue=0,port=16020] regionserver.RSRpcServices - Large batch operation detected
(greater than 5000) (HBASE-18023). Requested Number of Rows: 7096 Client: xxx In such cases, Check if services are generating way more than metrics which AMC is not able to handle at the current configuration. To rectify these type of issues, we can use below pointers to stabilize the collector. Check the metadata output and identify which all services are generating (>15k-20k) metrics. http://<ams-host>:6188/ws/v1/timeline/metrics/metadata We can increase the size of heap space of region-server (In case of Distributed) and hbase- master (In case of Embedded). [ Check the gc.log and collector.log file to help understand the current GC utilization]. If the services are generating large number of metrics, we can limit them by implementing whitelisting or blacklisting and check if AMS is getting stabilized. Ambari Metrics - Whitelisting In some cases, we will be seeing the following logs in region-server logs, which indicate that there is some issue with the region-server opening. ERROR [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=61320] regionserver.HRegionServer: Received CLOSE for a region which is not online, and we're not opening. To mitigate such issues, try the following steps and check the status of AMC. Connect with zookeeper CLI->
/usr/lib/ams-hbase/bin/hbase --config /etc/ams-hbase/conf zkcli
Remove all the znode->
rmr /ams-hbase-secure
Create a new znode in AMS configuration -> zookeeper.znode.parent and restart AMS Metrics not available in Ambari UI > If metrics are not showing on Ambari UI, then check if there is any present issues with the collector. I am providing some of the known issues for the latest version of Ambari where graphs are not available even though AMC is running fine. In latest version of Ambari-2.7.3 and above, we have observed the NIFI, Ambari_Metrics and Kafka does not show the metrics. Here are the workaround to mitigate the issue. a) NIFI->
1) vi /var/lib/ambari-server/resources/common-services/NIFI/1.0.0/metainfo.xml
Change <timelineAppid>NIFI</timelineAppid> to <timelineAppid>nifi</timelineAppid> at two places.
2) Replace "<timelineAppid>NIFI</timelineAppid>" with "<timelineAppid>nifi</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/ hdf-ambari-mpack-3.5.1.*/common-services/NIFI/1.0.0/metainfo.xml file
b) Ambari_Metrics:->
Replace "<timelineAppid>AMS-HBASE</timelineAppid>" with "<timelineAppid>ams-hbase</timelineAppid>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.2.b/services/AMBARI_METRICS/metainfo.xml file
c) Kafka:->
add "<timelineAppid>kafka_broker</timelineAppid>" after "<name>KAFKA_BROKER</name>" in /var/lib/ambari-server/resources/mpacks/hdf-ambari-mpack-3.5.1.*/stacks/HDF/3.3/services/KAFKA/metainfo.xml file
After above Change restart ambari-server. Grafana Metric related issues > Sometimes we see multiple issues with the Grafana such as Metrics not available, Time-range metrics are not being shown, inaccurate information in graphs. To check the Grafana issues, perform the following sequence. Check if AMC is running fine and you could see the aggregation happening in AMC logs. 2021-01-22 09:22:23,051 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Jan 22 09:22:23 UTC 2021
2021-01-22 09:23:20,780 INFO TimelineClusterAggregatorSecond: Started Timeline aggregator thread @ Fri Jan 22 09:23:20 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last Checkpoint read : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Rounded off checkpoint : Fri Jan 22 09:20:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Last check point time: 1611307200000, lagBy: 200 seconds.
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Start aggregation cycle @ Fri Jan 22 09:23:20 UTC 2021, startTime = Fri Jan 22 09:20:00 UTC 2021, endTime = Fri Jan 22 09:22:00 UTC 2021
2021-01-22 09:23:20,784 INFO TimelineClusterAggregatorSecond: Skipping aggregation for metric patterns : sdisk\_%,boottime
2021-01-22 09:23:23,129 INFO TimelineClusterAggregatorSecond: Saving 23764 metric aggregates.
2021-01-22 09:23:23,868 INFO TimelineClusterAggregatorSecond: End aggregation cycle @ Fri Jan 22 09:23:23 UTC 2021 2. If data is not present in all the graphs then check for the metadata output and look for the services metrics in meta output. 3. If data is not present in only few graphs then please check the meta-output for that particular service. Example: If data is not present over hive-server2 metrics: curl -v --insecure https://<ams-host>:6188/ws/v1/timeline/metrics/metadata?appId=hiveserver2 | python -m json.tool > hiveserver2.txt 4. Check if any whitelisting or blacklisting has been applied in the configuration, which might be stopping AMS to process those metrics. Also check for any configuration such as "Disable Minute host aggregator" is enabled in configs. 5. There are some known issues on the latest version of Ambari over grafana. I am listing few of them here. AMBARI-25570 AMBARI-25563 AMBARI-25383 AMBARI-25457 Happy Learning!!!!!!!
... View more
Labels:
09-16-2019
07:00 AM
has anyone implemented the same on cloudera rather than Ambari?
... View more
12-10-2018
07:00 AM
3 Kudos
When an ambari agent starts, it bootstraps with the ambari server via registration. The server sends information to the agent about the components that have been enabled for auto start along with the other auto start properties in ambari.properties. The agent compares the current state of these components against the desired state, to determine if these components are to be installed, started, restarted or stopped. These are the values ambari-server will send to ambari-agent by default unless configured in cluster-env.xml file. "recovery_max_count": "6", "recovery_lifetime_max_count": "1024", "recovery_type": "AUTO_START", "recovery_window_in_minutes": "60", recovery_lifetime_max_count ---- The maximum number of recovery attempts of a failed component during the lifetime of an Ambari Agent instance. This is reset when the Ambari Agent is restarted. recovery_window_in_minutes -- The length of a recovery window, in minutes, in which recovery attempts can be retried. recovery_max_count --- The maximum number of recovery attempts of a failed component during a specified recovery window. recovery_type ---- The type of automatic recovery of failed services and components to use. The following are examples of valid values for recovery_type Attribute: recovery_type Commands State Transitions AUTO_START Start INSTALLED → STARTED FULL Install, Start, Restart, Stop INIT → INSTALLED, INIT → STARTED, INSTALLED → STARTED, STARTED → STARTED, STARTED → INSTALLED DEFAULT None Auto start feature disabled For Example:-- If you want your host component not to auto start whenever your VM crashes and reboots then you have to change the "recovery_type": "AUTO_START" to "recovery_type": "DEFAULT"(Auto start feature disabled). Similarly if you want to decrease the number of recovery attempt of any failed component, Then you have to change the value of recovery_max_count accordingly. Hope this article will help!!!!! Reference:-- https://cwiki.apache.org/confluence/display/AMBARI/Recovery%3A+auto+start+components
... View more
Labels: