question DATA_NODE_WEB_METRIC_COLLECTION has become bad in Archives of Support Questions (Read Only)

DATA_NODE_WEB_METRIC_COLLECTION has become bad

szemlyanoy — Fri, 16 Sep 2022 09:16:25 GMT

Dear all,

Version: Cloudera Express 5.0.2
3 master nodes
15 workers

Problem:
"The health test result for DATA_NODE_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server."

When above alert pops up such record were noticed in datanode logs:
"INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3121ms"

Alerts are throwing from specific group of datanodes, not from all.

What can be the problem here?

Thanks in advance

Sergey

Re: DATA_NODE_WEB_METRIC_COLLECTION has become bad

GautamG — Wed, 24 Dec 2014 11:23:53 GMT

It is possible that the datanode is handling more blocks or dealing
with more traffic than its heap will allow. So there might be frequent
full garbage collection occurring which can cause such events.

How many blocks do these datanodes have? What is the heap setting?

Re: DATA_NODE_WEB_METRIC_COLLECTION has become bad

szemlyanoy — Wed, 24 Dec 2014 12:20:18 GMT

Yes one of my idea is about skewed data usage across datanodes.

I explored the data usage of nodes and noticed that those workers which triggers alerts have more block usage

bellow is comparison of sane nodes with the alerting ones

sane group

Capacity Used Non DFS Used Remaining Blocks Block pool used

14.21 TB

1.64 TB

664.86 GB

11.92 TB

127220

1.64 TB (11.55%)

14.21 TB

6.14 TB

666.38 GB

7.42 TB

639918

6.14 TB (43.23%)

14.21 TB

4.99 TB

665.79 GB

8.57 TB

465164

4.99 TB (35.11%)

14.21 TB

7.06 TB

666.4 GB

6.49 TB

795556

7.06 TB (49.71%)

14.21 TB

4.74 TB

665.74 GB

8.82 TB

445655

4.74 TB (33.35%)

14.21 TB

7.95 TB

666.13 GB

5.61 TB

907730

7.95 TB (55.96%)

14.21 TB

6.13 TB

666.08 GB

7.43 TB

640631

6.13 TB (43.12%)

group with issues

Capacity Used Non DFS Used Remaining Blocks Block pool used

10.65 TB

8.96 TB

500.07 GB

1.2 TB

1175053

8.96 TB (84.13%)

10.65 TB

8.57 TB

499.76 GB

1.59 TB

1136687

8.57 TB (80.51%)

14.21 TB

8.94 TB

666.97 GB

4.62 TB

1209608

8.94 TB (62.89%)

10.65 TB

8.65 TB

500.16 GB

1.5 TB

1133144

8.65 TB (81.28%)

14.21 TB

8.98 TB

665.07 GB

4.58 TB

1225707

8.98 TB (63.19%)

10.65 TB

8.62 TB

499.82 GB

1.54 TB

1168257

8.62 TB (80.98%)

10.65 TB

8.94 TB

499.75 GB

1.22 TB

1172198

8.94 TB (83.98%)

Notable that the ill ones have more blocks in the pool.

Heap size for DataNode Default Group - 1gb

Re: DATA_NODE_WEB_METRIC_COLLECTION has become bad

GautamG — Wed, 24 Dec 2014 12:42:51 GMT

It might be best to run the HDFS Balancer on a regular basis to remedy this. If you're running CDH 5.0.x or CDH 5.1.[0-3]. then consider upgrading to CDH 5.1.4 or CDH 5.2.0 for the fix to HDFS-6621.

Re: DATA_NODE_WEB_METRIC_COLLECTION has become bad

szemlyanoy — Wed, 24 Dec 2014 13:15:08 GMT

Hi Guatam,

Yes we run balancer on regular basis but seems we are hitting this bug. We have plans to upgrade CM stack but is the current issue related to balancer bugs?

Is there some relation between skewed balancer and web metrics alerts?

Thanks

Sergey

Re: DATA_NODE_WEB_METRIC_COLLECTION has become bad

GautamG — Fri, 26 Dec 2014 13:13:14 GMT

From what I understand till now, the issues only appear on datanodes which are containing a large number of blocks and these datanodes contain far more blocks than the healthy ones. This can be remedied by running the HDFS Balancer.

In CDH 5.x, bug HDFS-6621 affects balancer performance. It is fixed in the GA releases 5.1.4 and 5.2.0 (and later versions like 5.3.0). It is not fixed in any 5.0.x version. So please consider upgrading to one of the above releases for the fix.