Support Questions

szemlyanoy · ‎12-24-2014

Dear all,

Version: Cloudera Express 5.0.2
3 master nodes
15 workers

Problem:
"The health test result for DATA_NODE_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent is not able to communicate with this role's web server."

When above alert pops up such record were noticed in datanode logs:
"INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3121ms"

Alerts are throwing from specific group of datanodes, not from all.

What can be the problem here?

Thanks in advance

Sergey

GautamG · ‎12-26-2014

From what I understand till now, the issues only appear on datanodes which are containing a large number of blocks and these datanodes contain far more blocks than the healthy ones. This can be remedied by running the HDFS Balancer.

In CDH 5.x, bug HDFS-6621 affects balancer performance. It is fixed in the GA releases 5.1.4 and 5.2.0 (and later versions like 5.3.0). It is not fixed in any 5.0.x version. So please consider upgrading to one of the above releases for the fix.

Regards,
Gautam Gopalakrishnan

View solution in original post

GautamG · ‎12-24-2014

It is possible that the datanode is handling more blocks or dealing
with more traffic than its heap will allow. So there might be frequent
full garbage collection occurring which can cause such events.

How many blocks do these datanodes have? What is the heap setting?

Regards,
Gautam Gopalakrishnan

szemlyanoy · ‎12-24-2014

Yes one of my idea is about skewed data usage across datanodes.

I explored the data usage of nodes and noticed that those workers which triggers alerts have more block usage

bellow is comparison of sane nodes with the alerting ones

sane group

Capacity Used Non DFS Used Remaining Blocks Block pool used

14.21 TB

1.64 TB

664.86 GB

11.92 TB

127220

1.64 TB (11.55%)

14.21 TB

6.14 TB

666.38 GB

7.42 TB

639918

6.14 TB (43.23%)

14.21 TB

4.99 TB

665.79 GB

8.57 TB

465164

4.99 TB (35.11%)

14.21 TB

7.06 TB

666.4 GB

6.49 TB

795556

7.06 TB (49.71%)

14.21 TB

4.74 TB

665.74 GB

8.82 TB

445655

4.74 TB (33.35%)

14.21 TB

7.95 TB

666.13 GB

5.61 TB

907730

7.95 TB (55.96%)

14.21 TB

6.13 TB

666.08 GB

7.43 TB

640631

6.13 TB (43.12%)

group with issues

Capacity Used Non DFS Used Remaining Blocks Block pool used

10.65 TB

8.96 TB

500.07 GB

1.2 TB

1175053

8.96 TB (84.13%)

10.65 TB

8.57 TB

499.76 GB

1.59 TB

1136687

8.57 TB (80.51%)

14.21 TB

8.94 TB

666.97 GB

4.62 TB

1209608

8.94 TB (62.89%)

10.65 TB

8.65 TB

500.16 GB

1.5 TB

1133144

8.65 TB (81.28%)

14.21 TB

8.98 TB

665.07 GB

4.58 TB

1225707

8.98 TB (63.19%)

10.65 TB

8.62 TB

499.82 GB

1.54 TB

1168257

8.62 TB (80.98%)

10.65 TB

8.94 TB

499.75 GB

1.22 TB

1172198

8.94 TB (83.98%)

Notable that the ill ones have more blocks in the pool.

Heap size for DataNode Default Group - 1gb

GautamG · ‎12-24-2014

It might be best to run the HDFS Balancer on a regular basis to remedy this. If you're running CDH 5.0.x or CDH 5.1.[0-3]. then consider upgrading to CDH 5.1.4 or CDH 5.2.0 for the fix to HDFS-6621.

Regards,
Gautam Gopalakrishnan

szemlyanoy · ‎12-24-2014

Hi Guatam,

Yes we run balancer on regular basis but seems we are hitting this bug. We have plans to upgrade CM stack but is the current issue related to balancer bugs?

Is there some relation between skewed balancer and web metrics alerts?

Thanks

Sergey

GautamG · ‎12-26-2014

From what I understand till now, the issues only appear on datanodes which are containing a large number of blocks and these datanodes contain far more blocks than the healthy ones. This can be remedied by running the HDFS Balancer.

In CDH 5.x, bug HDFS-6621 affects balancer performance. It is fixed in the GA releases 5.1.4 and 5.2.0 (and later versions like 5.3.0). It is not fixed in any 5.0.x version. So please consider upgrading to one of the above releases for the fix.

Regards,
Gautam Gopalakrishnan