Hadoop can be deployed on a variety of scales. The requirements at each of these will be different. Hadoop has a large number of tunable parameters that can be used to influence its operation. Furthermore, there are a number of other technologies which can be deployed with Hadoop for additional capabilities.
Multiple tools exist to monitor large clusters for performance and troubleshooting. This section briefly highlights two such tools.
Ganglia is a performance monitoring framework for distributed systems. Ganglia provides a distributed service which collects metrics on individual machines and forwards them to an aggregator which can report back to an administrator on the global state of a cluster.
Ganglia is designed to be integrated into other applications to collect statistics about their operation. Hadoop includes a performance monitoring framework which can use Ganglia as its backend. Instructions are available on the Hadoop wiki as to how to enable Ganglia metrics in Hadoop. Instructions are also included below.
After installing and configuring Ganglia on your cluster, to direct Hadoop to output its metric reports to Ganglia, create a file named hadoop-metrics.properties in the $HADOOP_HOME/conf directory. The file should have the following contents:
This assumes that gmond is running on each machine in the cluster. Instructions on the Hadoop wiki note that (in the experience of the wiki article author) this may result in all nodes reporting their results as "localhost" instead of with their individual hostnames. If this problem affects your cluster, an alternate configuration is proposed, in which all Hadoop instances speak directly with gmetad:
Where @GMETAD@ is the hostname of the server on which the gmetad service is running. If deploying Ganglia and Hadoop on a very large number of machines, the impact of this configuration (vs. the standard Ganglia configuration where individual services talk to gmond on localhost) should be evaluated.
Nagios is a machine and service monitoring system designed for large clusters. Nagios will provide useful diagnostic information for tuning your cluster, including network, disk, and CPU utilization across machines.
The following are a few additional pieces of small advice:
Ganglia & Nagios have been deprecated & outside of HDP Support since HDP 2.3. Currently, AMS (Ambari Metrics System) is recommended for HDP monitoring. Above link details with the same.