Created on 09-27-2017 12:25 PM - edited 09-16-2022 05:18 AM
I have a graphite server, to which I want to send Hadoop metrics2.
On paper it's easy. Just add log4j.logger.org.apache.hadoop.metrics2=DEBUG to the log4j template and update hadoop-metrics2.properties template with:
*.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink *.sink.graphite.server_host=10.x.x.x *.sink.graphite.server_port=2003 datanode.sink.graphite.metrics_prefix=datanode namenode.sink.graphite.metrics_prefix=namenode resourcemanager.sink.graphite.metrics_prefix=resourcemanager nodemanager.sink.graphite.metrics_prefix=nodemanager jobhistoryserver.sink.graphite.metrics_prefix=jobhistoryserver journalnode.sink.graphite.metrics_prefix=journalnode maptask.sink.graphite.metrics_prefix=maptask reducetask.sink.graphite.metrics_prefix=reducetask applicationhistoryserver.sink.graphite.metrics_prefix=applicationhistoryserver
It works very well with one service (eg. datanode). If I put more than one, I will only get 2 services in graphite, and I cannot confirm that all metrics for those services are present.
Not knowing what metrics to expect and wanting to experiment, I do not want to filter on actual metric to limit their number.
On collectd side I can see one metric dropped (invalid), but one metric only. It does not account for all the rest. Furthemore, setting CollectInternalStats to true shows me that no metrics is dropped.
On Hadoop side... Well, I could not find anything telling me if metrics ar actually sent or not, if it succeeds or fail... Not logging anywhere.
So my 2 questions are:
Context: hdp2.6 on AWS.
Created 09-27-2017 06:32 PM
You should be able to find DEBUG level messages in the individual Hadoop service logs; messages starting with org.apache.hadoop.metrics2.*
One config missing is:
# default sampling period
*.period=10
Created 09-27-2017 06:32 PM
You should be able to find DEBUG level messages in the individual Hadoop service logs; messages starting with org.apache.hadoop.metrics2.*
One config missing is:
# default sampling period
*.period=10
Created 09-28-2017 11:14 AM
Fair enough about the *.period. As I did get metrics there is probably a smart default, but nice to have.
I indeed found some messages in the service logs, and all looks good. To be honest, it all worked today.
I then happily applied the settings to prod, and lo and behold, I only have 2 metrics there.
Carrying on thinking, I understood is that in metrics2.properties I say that I want for instance node manager metrics, but I then actually need to restart the node manages to see those metrics. Indeed, the cluster I worked on yesterday has been rebooted (dev cluster, switched off at night).
Now all works as expected.
Thanks!