Created on 12-24-2014 01:12 AM - edited 09-16-2022 02:16 AM
Dear all,
Version: Cloudera Express 5.0.2
3 master nodes
15 workers
Problem:
"The health test result for DATA_NODE_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server."
When above alert pops up such record were noticed in datanode logs:
"INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3121ms"
Alerts are throwing from specific group of datanodes, not from all.
What can be the problem here?
Thanks in advance
Sergey
Created 12-26-2014 05:13 AM
From what I understand till now, the issues only appear on datanodes which are containing a large number of blocks and these datanodes contain far more blocks than the healthy ones. This can be remedied by running the HDFS Balancer.
In CDH 5.x, bug HDFS-6621 affects balancer performance. It is fixed in the GA releases 5.1.4 and 5.2.0 (and later versions like 5.3.0). It is not fixed in any 5.0.x version. So please consider upgrading to one of the above releases for the fix.
Created 12-24-2014 03:23 AM
Created 12-24-2014 04:20 AM
Yes one of my idea is about skewed data usage across datanodes.
I explored the data usage of nodes and noticed that those workers which triggers alerts have more block usage
bellow is comparison of sane nodes with the alerting ones
sane group
Capacity Used Non DFS Used Remaining Blocks Block pool used
14.21 TB | 1.64 TB | 664.86 GB | 11.92 TB | 127220 | 1.64 TB (11.55%) |
14.21 TB | 6.14 TB | 666.38 GB | 7.42 TB | 639918 | 6.14 TB (43.23%) |
14.21 TB | 4.99 TB | 665.79 GB | 8.57 TB | 465164 | 4.99 TB (35.11%) |
14.21 TB | 7.06 TB | 666.4 GB | 6.49 TB | 795556 | 7.06 TB (49.71%) |
14.21 TB | 4.74 TB | 665.74 GB | 8.82 TB | 445655 | 4.74 TB (33.35%) |
14.21 TB | 7.95 TB | 666.13 GB | 5.61 TB | 907730 | 7.95 TB (55.96%) |
14.21 TB | 6.13 TB | 666.08 GB | 7.43 TB | 640631 | 6.13 TB (43.12%) |
group with issues
Capacity Used Non DFS Used Remaining Blocks Block pool used
10.65 TB | 8.96 TB | 500.07 GB | 1.2 TB | 1175053 | 8.96 TB (84.13%) |
10.65 TB | 8.57 TB | 499.76 GB | 1.59 TB | 1136687 | 8.57 TB (80.51%) |
14.21 TB | 8.94 TB | 666.97 GB | 4.62 TB | 1209608 | 8.94 TB (62.89%) |
10.65 TB | 8.65 TB | 500.16 GB | 1.5 TB | 1133144 | 8.65 TB (81.28%) |
14.21 TB | 8.98 TB | 665.07 GB | 4.58 TB | 1225707 | 8.98 TB (63.19%) |
10.65 TB | 8.62 TB | 499.82 GB | 1.54 TB | 1168257 | 8.62 TB (80.98%) |
10.65 TB | 8.94 TB | 499.75 GB | 1.22 TB | 1172198 | 8.94 TB (83.98%) |
Notable that the ill ones have more blocks in the pool.
Heap size for DataNode Default Group - 1gb
Created 12-24-2014 04:42 AM
Created 12-24-2014 05:15 AM
Hi Guatam,
Yes we run balancer on regular basis but seems we are hitting this bug. We have plans to upgrade CM stack but is the current issue related to balancer bugs?
Is there some relation between skewed balancer and web metrics alerts?
Thanks
Sergey
Created 12-26-2014 05:13 AM
From what I understand till now, the issues only appear on datanodes which are containing a large number of blocks and these datanodes contain far more blocks than the healthy ones. This can be remedied by running the HDFS Balancer.
In CDH 5.x, bug HDFS-6621 affects balancer performance. It is fixed in the GA releases 5.1.4 and 5.2.0 (and later versions like 5.3.0). It is not fixed in any 5.0.x version. So please consider upgrading to one of the above releases for the fix.