Hadoop can be deployed on a variety of scales. The requirements at each of these will be different. Hadoop has a large number of tunable parameters that can be used to influence its operation. Furthermore, there are a number of other technologies which can be deployed with Hadoop for additional capabilities.
Multiple tools exist to monitor large clusters for performance and troubleshooting. This section briefly highlights two such tools.
Ganglia is a performance monitoring framework for distributed systems. Ganglia provides a distributed service which collects metrics on individual machines and forwards them to an aggregator which can report back to an administrator on the global state of a cluster.
Ganglia is designed to be integrated into other applications to collect statistics about their operation. Hadoop includes a performance monitoring framework which can use Ganglia as its backend. Instructions are available on the Hadoop wiki as to how to enable Ganglia metrics in Hadoop. Instructions are also included below.
After installing and configuring Ganglia on your cluster, to direct Hadoop to output its metric reports to Ganglia, create a file named hadoop-metrics.properties in the $HADOOP_HOME/conf directory. The file should have the following contents:
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContextdfs.period=10dfs.servers=localhost:8649mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContextmapred.period=10mapred.servers=localhost:8649
This assumes that gmond is running on each machine in the cluster. Instructions on the Hadoop wiki note that (in the experience of the wiki article author) this may result in all nodes reporting their results as "localhost" instead of with their individual hostnames. If this problem affects your cluster, an alternate configuration is proposed, in which all Hadoop instances speak directly with gmetad:
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContextdfs.period=10dfs.servers=@GMETAD@:8650mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContextmapred.period=10mapred.servers=@GMETAD@:8650
Where @GMETAD@ is the hostname of the server on which the gmetad service is running. If deploying Ganglia and Hadoop on a very large number of machines, the impact of this configuration (vs. the standard Ganglia configuration where individual services talk to gmond on localhost) should be evaluated.
Nagios is a machine and service monitoring system designed for large clusters. Nagios will provide useful diagnostic information for tuning your cluster, including network, disk, and CPU utilization across machines.
The following are a few additional pieces of small advice:
Created 02-22-2017 11:52 AM
@kiran gutha - You might want to post this as an article. You posted this as a question.
Please convert this into an article.
Created 02-22-2017 11:52 AM
This article does not bring anything new to HDP. HDP uses Ambari Metrics, Grafana etc.
Created 02-26-2018 09:24 AM
https://community.hortonworks.com/questions/118525/nagios-and-ganglia-integation-into-ambari.html
Ganglia & Nagios have been deprecated & outside of HDP Support since HDP 2.3. Currently, AMS (Ambari Metrics System) is recommended for HDP monitoring. Above link details with the same.