Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Master Mentor

Ambari is the heart of any HDP cluster. It provides us the feature of provisioning, managing, monitoring and securing Hadoop / HDP clusters. It's is a Java program which interacts with Database to read the cluster details and runs on embedded jetty server. Many times we find issues with Ambari server performance.

It's Ambari UI operations sometimes responds slowly or the startup might take longer time if it is not properly tuned. So in order to troubleshoot the ambari server performance related issues we should look at some of the data/stats and tuning parameters to make the ambari server perform better. In this article we will talk about some very basic tuning parameters and performance related troubleshooting.

What all information's needed?

When we notice that the ambari server is responding slow then we should look first the following details first:

1). The number of hosts added to the ambari cluster. So that accordingly we can tune the ambari agent thread pools.

2). The number of concurrent users (or the view users) who access the ambari server at a time. So that accordingly we can tune the ambari thread pools.

3). The age of the ambari cluster. If the ambari server is too old then the possibility is that some of the operational logs and the alert histories will be consuming a large amount of the Database which might be causing ambari DB queries to respond slow.

4). The Ambari Database health and it's geographic location from the ambari server, to isolate if there are any network delays.

5). Ambari server memory related tuning parameters to see if the ambari heap is set correctly.

6). For ambari UI slowness we should check the network proxy issues to see if there are any network proxies added between client the ambari server machine Or the network slowness.

7). If the ambari users are synced with the AD or external LDAP and if the communication between server and the AD/LDAP is good.

8). Also the resource availability on the ambari host like the available free memory and if any other service/component running on ambari server is consuming more Memory/CPU/IO.

.

How to Troubleshoot?

Usually we start with checking the ambari server memory settings, host level resource availability (Like: Memory/CPU/IO) and the thread dumps to see where the threads are stuck or taking long time to execute certain api/database calls.

.

Check-1). We will check the ambari-server log to see if there are any repeated warning or error messages.

.

Check-2). First we should check if the ambari-server host has enough free memory and CPU available, Also the list of open files (to see if there are any leaking), netstat output to find out if there are any CLOSE_WAIT or TIME_WAIT sockets. That we can check by running the following commands on the ambari server host.

Example:

# free -m 
# top
# lsof -p $AMBARI_PID
# netstat -tnlpa | grep  $AMBARI_PID

.

Check-3). If we see that enough free memory and CPU cycles are available then we can check if the thread dump shows us any stuck/blocked threads or the activities of the threads are normal ?

In order to do that we can collect ambari-server thread dumps. We can refer to the following article to know how to colect the ambari server thread dumps. We can use the "$JAVA_HOME/bin/jcmd" or "$JAVA_HOME/bin/jstack" kind of jvm utilities to do so.

https://community.hortonworks.com/articles/72319/how-to-collect-threaddump-using-jcmd-and-analyse-i....

It is always recommended to collect at least 5-6 thread dumps in some interval like 10 seconds after each thread dump. This gives us a detailed idea about the thread activities during a period of time. The thread dump should be collected when we see the slow response from the ambari server else the thread dumps will show normal behavior.

.

Check-4). Sometimes we may encounter OutOfMemoryError in ambari-server log as following which indicates that ambari server Heap size is not tuned properly or it needs to be increased a bit more:

    Exception in thread "qtp-ambari-agent-91" java.lang.OutOfMemoryError: Java heap space

There are some recommendations available for ambari server heap tuning based on the cluster size as part of the doc that can be used for heap tuning: https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.0.0/bk_ambari-administration/content/ch_tuning_...

.

We should also check the current memory utilization statistics of the ambari server. We can use the JVM utility "jmap" for the same.

Example:

/usr/jdk64/jdk1.8.0_112/bin/jmap -heap $AMBARI_SERVER_PID


Output:

# /usr/jdk64/jdk1.8.0_112/bin/jmap -heap `cat /var/run/ambari-server/ambari-server.pid`
Attaching to process ID 673, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.112-b15
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 2147483648 (2048.0MB)
   NewSize                  = 134217728 (128.0MB)
   MaxNewSize               = 536870912 (512.0MB)
   OldSize                  = 402653184 (384.0MB)
   NewRatio                 = 3
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 120848384 (115.25MB)
   used     = 78420056 (74.78719329833984MB)
   free     = 42428328 (40.462806701660156MB)
   64.89127401157471% used
Eden Space:
   capacity = 107479040 (102.5MB)
   used     = 72431960 (69.07649993896484MB)
   free     = 35047080 (33.423500061035156MB)
   67.39170725752668% used
From Space:
   capacity = 13369344 (12.75MB)
   used     = 5988096 (5.710693359375MB)
   free     = 7381248 (7.039306640625MB)
   44.7897518382353% used
To Space:
   capacity = 13369344 (12.75MB)
   used     = 0 (0.0MB)
   free     = 13369344 (12.75MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 402653184 (384.0MB)
   used     = 87617376 (83.55844116210938MB)
   free     = 315035808 (300.4415588378906MB)
   21.760010719299316% used
37359 interned Strings occupying 3641736 bytes.

.

If the used heap usage is high and reaching the max heap then we can try increase the amabri-server memory by editing the "/var/lib/ambari-server/ambari-env.sh" file and increasing the heap memory (-Xmx4g) inside the property "AMBARI_JVM_ARGS" something as following:

# grep 'AMBARI_JVM_ARGS' /var/lib/ambari-server/ambari-env.sh
export AMBARI_JVM_ARGS=$AMBARI_JVM_ARGS' -Xms4g -Xmx4g -XX:MaxPermSize=128m -Djava.security.auth.login.config=$ROOT/etc/ambari-server/conf/krb5JAASLogin.conf -Djava.security.krb5.conf=/etc/krb5.conf -Djavax.security.auth.useSubjectCredsOnly=false'

.

Check-5). If we want to monitor heap and garbage collection details over a period of time then we can also enable the Garbage Collection logging for the ambari server by adding the GC log option in ambari "ambari-env.sh" file as following:

# grep 'AMBARI_JVM_ARGS' /var/lib/ambari-server/ambari-env.sh
export AMBARI_JVM_ARGS=$AMBARI_JVM_ARGS' -Xms512m -Xmx2048m -XX:MaxPermSize=128m -Djava.security.auth.login.config=$ROOT/etc/ambari-server/conf/krb5JAASLogin.conf -Djava.security.krb5.conf=/etc/krb5.conf -Djavax.security.auth.useSubjectCredsOnly=false  -Xloggc:/var/log/ambari-server/ambari-server_gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps'

.

.

Ambari JVM/Database Monitoring using Grafana

Check-6). From Ambari 2.5 onward, We can also check the ambari performance statistics related to ambari jvm and database. For more information on this please refer to: https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.1.0/bk_ambari-operations/content/grafana_ambari...

http://$GRAFANA_HOST:3000/dashboard/db/ambari-server-jvm
http://$GRAFANA_HOST:3000/dashboard/db/ambari-server-database

.

34637-ambari-jvm-metrics.png

If the ambari server metrics are not enabled then we can enable it. To enable Ambari Server metrics, make sure the following config file exists during Ambari Server start/restart - "/etc/ambari-server/conf/metrics.properties".

Currently, only 2 metric sources have been implemented - JVM Metric Source and Database Metric Source.
To add / remove a metric source to be tracked the following config needs to be modified in the metrics.properties file.

metric.sources=jvm,database

Example:

# grep 'metric.sources' /etc/ambari-server/conf/metrics.properties
metric.sources=jvm,database

.

NOTE: Please do not forget to add the following line inside the "ambari.properties" file.

# grep 'profiler' /etc/ambari-server/conf/ambari.properties
server.persistence.properties.eclipselink.profiler=org.apache.ambari.server.metrics.system.impl.AmbariPerformanceMonitor

52386-ambari-database-metrics.png

.

.

 

Ambari Thread Pool Tuning

Check-7). If the cluster size is large then we should also tune the "agent.threadpool.size.max" property inside the "/etc/ambari-server/conf/ambari.properties" file.

"agent.threadpool.size.max" : property sets max number of threads used to process heartbeats from ambari agents. The default value for this property is "25". This basically indicates the size of the Jetty connection pool used for handling incoming Ambari Agent requests.

# grep 'agent.threadpool.size.max' /etc/ambari-server/conf/ambari.properties
50

.

.

Check-8). If inside our ambari server we have some views (like Hive/File View ..etc) which is accessed by many concurrent users Or if there are many users access the ambari UI concurrently or makes Ambari Rest API calls. Then in such cases we should also increase the "client.threadpool.size.max" property value (default values is 25) inside the "/etc/ambari-server/conf/ambari.properties".

"client.threadpool.size.max" : The size of the Jetty connection pool used for handling incoming REST API requests. This should be large enough to handle requests from both web browsers and embedded Views.

# grep 'client.threadpool.size.max' /etc/ambari-server/conf/ambari.properties
100

If the client thread pool size is not set properly then while accessing ambari UI or making Ambari API calls we might see the following kind of response:

    {
      status: 503,
      message: "There are no available threads to handle view requests"
    }

.

.

 

Ambari Connection Pool Tuning.

Check-9). We can also add the following properties to adjust the JDBC connection pool settings for large clusters like above 100 nodes or based on need:

server.jdbc.connection-pool.acquisition-size=5
server.jdbc.connection-pool.max-age=0
server.jdbc.connection-pool.max-idle-time=14400
server.jdbc.connection-pool.max-idle-time-excess=0
server.jdbc.connection-pool.idle-test-interval=7200

- If using MySQL as the Ambari database, in your MSQL configuration, increase the wait_timeout and interacitve_timeout to 8 hours (28800) and max. connections from 32 to 128.


- It is critical that the Ambari configuration for "server.jdbc.connection-pool.max-idle-time" and "server.jdbc.connection-pool.idle-test-interval" must be lower than the MySQL "wait_timeout" and "interactive_timeout" set on the MySQL side. If you choose to decrease these timeout values, adjust down "server.jdbc.connection-pool.max-idle-time" and "server.jdbc.connection-pool.idle-test-interval" accordingly in the Ambari configuration so that they are less than "wait_timeout" and interactive_timeout.

.

.

 

Ambari Cache Tuning

Check-10). If the cluster size if more than 200 nodes then tuning the Cache will helps sometimes. For that we Calculate the new, larger cache size, using the following relationship, where <cluster_size> is the number of nodes in the cluster.

Following how to approximate value is calculated.

ecCacheSizeValue=60*<cluster_size>

Following part says how to apply that property

On the Ambari Server host, in /etc/ambari-server/conf/ambari-properties, add the following property and value. If the cluster has 500 nodes then we can set it to:

server.ecCacheSize=30000

.

.

 

Ambari Alert Related Tuning

Check-11). Setting "alerts.cache.enabled" , If the value for this property is set to "true", then alerts processed by the "AlertReceivedListener" will not write alert data to the database on every event. Instead, data like timestamps and text will be kept in a cache and flushed out periodically to the database. The default value is "false". Alert caching was experimental around ambari 2.2.2 version.

We can enable the Alerts cache and then monitor it for few days to see it's effect. We will need to add this parameter to "/etc/ambari-server/conf/ambari.properties". Some other properties related to alert caching & alert execution scheduler are as following.

 

Example:

alerts.cache.enabled=true
alerts.cache.size=100000
alerts.execution.scheduler.threadpool.size.core=4
alerts.execution.scheduler.threadpool.size.max=8


The "alerts.cache.size" defines the size of the alert cache which is by default set to "50000" when the alerts.cache.enabled.

"alerts.execution.scheduler.threadpool.size.core" defines the core number of threads used to process incoming alert events. The value should be increased as the size of the cluster increases.

"alerts.execution.scheduler.threadpool.size.max" defines the maximum number of threads which will handle published alert events. Default value is "2".

.

.

Ambari API Response Time Check

Check-12). During slowness of ambari we can try running the following curl call (which tries to fetch the cluster details) to see how much time does it take to get the cluster details. It gives us some idea if the cluster json response is taking some time or if it is too large or has lots of .

# time curl -i -u admin:admin -H 'X-Requested-By: ambari' -X GET http://amb25101.example.com:8080/api/v1/clusters/plain_cluster
real    0m20.234s
user    0m0.009s
sys     0m0.017s
# time curl -i -u admin:admin -H 'X-Requested-By: ambari' -X GET  http://amb25101.example.com:8080/api/v1/clusters/plain_cluster?fields=Clusters/desired_configs
# time curl -i -u admin:admin -H 'X-Requested-By: ambari' -X GET  http://amb25101.example.com:8080/api/v1/clusters/plain_cluster?fields=Clusters/health_report,Cluster...

"user" means userspace, so the number of CPU seconds spent doing work in the JVM code. User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.


"sys" means kernel-space, so the number of cpu-seconds spent doing work in the kernel. Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process.


"real" means "wall lock" time. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).


Example: For example ["user=3.00 sys=0.05 real=1.00"] means there was

  >>> 50ms of kernel work, 
  >>> 3s of jvm work and 
  >>> overall it took 1 second

.

.

 

Ambari Database Query Logging

Check-13). In some cases it is useful to enable the Database Query Logging to find out how the queries are getting executed and how many times which query is getting executed.

We can enable the "server.jdbc.properties.loglevel=2" property inside the "/etc/ambari-server/conf/ambari.properties" file and restart the ambari server which will start writing the JDBC queries to the "/var/log/ambari-server/ambari-server.out" file.

# grep 'server.jdbc.properties.loglevel' /etc/ambari-server/conf/ambari.properties
server.jdbc.properties.loglevel=2

.

Example output of logged queries from ambari-server.out

# grep 'SELECT alert_' ambari-server.out 
16:17:19.432 (3)  FE=> Parse(stmt=null,query="SELECT alert_id, alert_definition_id, alert_instance, alert_label, alert_state, alert_text, alert_timestamp, cluster_id, component_name, host_name, service_name FROM alert_history WHERE (alert_id = $1)",oids={20})
16:17:19.439 (6)  FE=> Parse(stmt=null,query="SELECT alert_id, alert_definition_id, alert_instance, alert_label, alert_state, alert_text, alert_timestamp, cluster_id, component_name, host_name, service_name FROM alert_history WHERE (alert_id = $1)",oids={20})
16:26:38.424 (3)  FE=> Parse(stmt=null,query="SELECT t1.alert_id AS a1, t1.definition_id AS a2, t1.firmness AS a3, t1.history_id AS a4, t1.latest_text AS a5, t1.latest_timestamp AS a6, t1.maintenance_state AS a7, t1.occurrences AS a8, t1.original_timestamp AS a9 FROM alert_history t0, alert_definition t2, alert_current t1 WHERE ((((t0.cluster_id = $1) AND (t2.definition_name = $2)) AND (t0.host_name = $3)) AND ((t0.alert_id = t1.history_id) AND (t2.definition_id = t0.alert_definition_id))) LIMIT $4 OFFSET $5",oids={20,1043,1043,23,23})

.

.

Ambari Database Query/Performance Monitor

Check-14). In some cases it is also useful to enable "QueryMonitor" and "PerformanceMonitor" statistics. The "QueryMonitor" is used to measure query executions and cache hits. This can be useful for performance analysis in a complex system. The batch writing, this value is the number of statements to batch (default: 100)

Instead of "QueryMonitor" We also use native EclipseLink "PerformanceMonitor" to count how many queries are actually hitting the DB. The performance monitor and query monitor can be enabled in ambari through "/etc/ambari-server/conf/ambari.properties" using the below property:

Example:

server.persistence.properties.eclipselink.profiler=PerformanceMonitor
server.persistence.properties.eclipselink.jdbc.batch-writing.size=25
server.persistence.properties.eclipselink.profiler=QueryMonitor

In order to know more about how to use them properly, we can refer to the following article: https://community.hortonworks.com/articles/73269/how-to-analyze-the-ambari-servers-db-activity-perf....

.

.

Ambari Database Cleanup / Purge

Check-15). In some old clusters we see that there are lots of old "alert_history" or old alert notification data entries present in the database that causes slowness, As with time these entries grows much on the database. So the DB dump size also grows and the DB queries can respond slow results. We can use the following command to perform some DB cleanup.

# ambari-server db-cleanup -d 2016-09-30 --cluster-name=MyCluster

For more details on this refer to: https://community.hortonworks.com/articles/134958/ambari-database-cleanup-speed-up.html

https://issues.apache.org/jira/browse/AMBARI-20687

The db-cleanup works well from ambari 2.5.0/2.5.1 (ambari 2.4 there were some issues reported).

.

From Ambari 2.5.2 Onwards: From Ambari 2.5.2 onwards the name of this operation will be changed to "db-purge-history" and apart from the Alert related tables it should also consider of other tables lie host_role_command and execution_commands and if there is any other tables as well.

 

# ambari-server db-purge-history --cluster-name Prod --from-date 2017-08-01

See: https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.2.0/bk_ambari-administration/content/purging-am...

The "db-purge-history" command will analyze the following tables in the Ambari Server database and remove those rows that can be deleted that have a create date after the --from-date specified when the command is run.

.

AlertCurrent
AlertNotice
ExecutionCommand
HostRoleCommand
Request
RequestOperationLevel
RequestResourceFilter
RoleSuccessCriteria
Stage
TopologyHostRequest
TopologyHostTask
TopologyLogicalTask

.

.

17,349 Views
Comments
avatar
Super Collaborator

Very well-written article.

avatar
Expert Contributor

Excellent article Jay

avatar
Rising Star

Thank you for the very helpful article.