Reply
Explorer
Posts: 19
Registered: ‎08-30-2013

Bad health due to com.cloudera.cmon.agent.DnsTest timeout

Problems:

More and more data nodes become bad health in Cloudera Manager.

 

Clue1:

no any task or job, just an idle data node here,

#top

-bash-4.1$ top
top - 18:27:22 up  4:59,  3 users,  load average: 4.55, 3.52, 3.18
Tasks: 139 total,   1 running, 137 sleeping,   1 stopped,   0 zombie
Cpu(s): 14.8%us, 85.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   7932720k total,  1243372k used,  6689348k free,    52244k buffers
Swap:  6160376k total,        0k used,  6160376k free,   267228k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                 
13766 root      20   0 2664m  21m 7048 S 85.4  0.3 190:34.75 java                    
17688 root      20   0 2664m  19m 7048 S 75.5  0.3   1:05.97 java                    
12765 root      20   0 2859m  21m 7140 S 36.9  0.3 133:25.46 java                    
 2909 mapred    20   0 1894m 113m  14m S  1.0  1.5   2:55.26 java                    
 1850 root      20   0 1469m  62m 4436 S  0.7  0.8   2:54.53 python                  
 1332 root      20   0 50000 3000 2424 S  0.3  0.0   0:12.04 vmtoolsd                
 2683 hbase     20   0 1927m 152m  18m S  0.3  2.0   0:36.64 java    

 

Clue2:

-bash-4.1$ ps -ef|grep 13766
root     13766  1850 99 16:01 ?        03:12:54 java -classpath /usr/share/cmf/lib/agent-4.6.3.jar com.cloudera.cmon.agent.DnsTest

 

Clue3:

in cloudera-scm-agent.log,

[30/Aug/2013 16:01:58 +0000] 1850 Monitor-HostMonitor throttling_logger ERROR    Timeout with args ['java', '-classpath', '/usr/share/cmf/lib/agent-4.6.3.jar', 'com.cloudera.cmon.agent.DnsTest']
None
[30/Aug/2013 16:01:58 +0000] 1850 Monitor-HostMonitor throttling_logger ERROR    Failed to collect java-based DNS names
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 53, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._poll_timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 42, in _subprocess_with_timeout
    return SubprocessTimeout().subprocess_with_timeout(args, timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 70, in subprocess_with_timeout
    raise Exception("timeout with args %s" % args)
Exception: timeout with args ['java', '-classpath', '/usr/share/cmf/lib/agent-4.6.3.jar', 'com.cloudera.cmon.agent.DnsTest']
"cloudera-scm-agent.log" line 30357 of 30357 --100%-- col 1

 

Backgrouds:

1. if I restart all nodes, then everythings are OK, but after half and hour or more, bad health is coming one by one.

2. Version: Cloudera Standard 4.6.3 (#192 built by jenkins on 20130812-1221 git: fa61cf8559fbefeb5af7f223fd02164d1a0adfdb)

3. I added all nodes in /etc/hosts

4. the installed CDH is 4.3.1.

5. in fact, these nodes are VMs with fixed IP address.

 

Any suggestions?

 

BTW, where can I download source code of com.cloudera.cmon.agent.DnsTest?

 

 

Thanks.

 

Explorer
Posts: 19
Registered: ‎08-30-2013

Re: Bad health due to com.cloudera.cmon.agent.DnsTest timeout

Some updates:

 

Seems it's a bug, since there are some java applications originated from python cannot be completed randomly.

And these java_version.sh, DnsTest should be completed soon, but why are they always running and take so much CPU and MEM?

 

As the info below:

#top

17688 root      20   0 2664m  21m 7048 S 88.8  0.3  20:04.49 java
13766 root      20   0 2664m  21m 7048 S 78.1  0.3 209:26.49 java
12765 root      20   0 2859m  21m 7140 S 29.9  0.3 142:54.04 java
 1850 root      20   0 1469m  62m 4436 S  1.3  0.8   3:06.87 python
 2909 mapred    20   0 1894m 115m  14m S  1.0  1.5   3:08.19 java
 2683 hbase     20   0 1927m 152m  18m S  0.7  2.0   0:38.83 java
 2518 hdfs      20   0 1883m 147m  14m S  0.3  1.9   0:43.64 java

-bash-4.1# ps -ef|grep 1850
root      1850     1  0 13:28 ?        00:03:11 /usr/lib64/cmf/agent/build/env/bin/python /usr/lib64/cmf/agent/src/cmf/agent.py --package_dir /usr/lib64/cmf/service --agent_dir /var/run/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-scm-agent.log
root     12762  1850  0 15:23 ?        00:00:00 /bin/bash /usr/lib64/cmf/service/mgmt/java_version.sh
root     13766  1850 99 16:01 ?        03:36:15 java -classpath /usr/share/cmf/lib/agent-4.6.3.jar com.cloudera.cmon.agent.DnsTest
root     17688  1850 79 18:25 ?        00:26:56 java -classpath /usr/share/cmf/lib/agent-4.6.3.jar com.cloudera.cmon.agent.DnsTest
root     18584 15768  0 18:59 pts/2    00:00:00 grep 1850

Is it a bug?

 

 

 

Cloudera Employee
Posts: 79
Registered: ‎08-29-2013

Re: Bad health due to com.cloudera.cmon.agent.DnsTest timeout

Hello,

I do believe this was a recently identified bug, yes. Allow me to find the
details and remediation and post back to you.

Cloudera Employee
Posts: 79
Registered: ‎08-29-2013

Re: Bad health due to com.cloudera.cmon.agent.DnsTest timeout

Hi qwert,

 

This was indeed recently found to be a bug condition and will be addressed in the next version of CM.

 

For the interim, please disable the two items shown in this screenshot and you'll avoid hitting the timeout error until the formal fix is available:

 

disable_these_checks

Image Detail:

===========

Hosts > Configuration > Search "resol"

- Host DNS Resolution Duration Thresholds: Set "Warning" and "Critical" to never.

- Hostname and Canonical Name Health Check: uncheck box

- Save Changes

 

Regards,

--

Explorer
Posts: 19
Registered: ‎08-30-2013

Re: Bad health due to com.cloudera.cmon.agent.DnsTest timeout

@smark,


Thanks for your information.

I will try your temporary fix.

 

But I have some questions here,

Q1:

If I change the configurations on web console, does it mean no process suspending on slave nodes, or just no alert on web console?

If only no alert on web console, but still suspending processes are on slave nodes, then still problems, since CPU/MEM are taken by these crazy processes.

 

Q2:

I also found other processes suspending as below, can I change some configurations for the item on web console?

And I don't know if any other suspending processes need to fix.

 

-bash-4.1# ps -ef|grep 24602
root 24602 24599 22 Aug30 ? 14:36:30 /usr/java/jdk1.6.0_31/bin/java -version

 

 

Here, I have opened a bug for this issue:

https://issues.cloudera.org/browse/CM-52?focusedCommentId=18391#comment-18391

 

 

Explorer
Posts: 19
Registered: ‎08-30-2013

Re: Bad health due to com.cloudera.cmon.agent.DnsTest timeout

Current status:

No problem with DnsTest as Smark's suggestion,

but "java -version" still hangs with big CPU and MEM, so we need to fix the issue ASAP.

Cloudera Employee
Posts: 11
Registered: ‎07-29-2013

Re: Bad health due to com.cloudera.cmon.agent.DnsTest timeout

I've never before seen "java -version" hang.  Does it hang when you run it manually?  Attaching strace would be useful to diagnose what's going on in your system.  There's no easy way to turn that logic off at the moment.

Explorer
Posts: 19
Registered: ‎08-30-2013

Re: Bad health due to com.cloudera.cmon.agent.DnsTest timeout

No hang when run java -version mannually.

 

I have 10 datanodes, they all hung after about several days.

 

-------------------------------------------

A snapshot again:

-bash-4.1$ top
top - 11:00:57 up 6 days, 22:04,  1 user,  load average: 1.05, 1.00, 1.00
Tasks: 129 total,   1 running, 128 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.9%us, 41.2%sy,  0.0%ni, 48.5%id,  0.2%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   7932720k total,  7769284k used,   163436k free,   214848k buffers
Swap:  6160376k total,        0k used,  6160376k free,  1220472k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                 
 9259 root      20   0 2664m  19m 7048 S 99.4  0.3   6377:31 java                    
 1847 root      20   0 1469m  64m 3244 S  1.0  0.8 127:41.29 python                  
 8765 mapred    20   0 1922m 248m  14m S  1.0  3.2  88:45.19 java                    
 1329 root      20   0 50000 2144 1572 S  0.3  0.0   9:14.89 vmtoolsd                
 8404 hdfs      20   0 1895m 156m  13m S  0.3  2.0  24:02.40 java                    
 8995 impala    20   0 2804m 141m  10m S  0.3  1.8  19:59.23 impalad    

 

-------------------------------------------

root      9256  1847  0 Sep05 ?        00:00:00 /bin/bash /usr/lib64/cmf/service/mgmt/java_version.sh
root      9259  9256 99 Sep05 ?        4-10:17:38 /usr/java/jdk1.6.0_31/bin/java -version

 

 

-------------------------------------------

-bash-4.1# strace -p 9259
Process 9259 attached - interrupt to quit
futex(0x7f4f77ff19e0, FUTEX_WAIT, 9260, NULL

 

-------------------------------------------

-bash-4.1# /usr/java/default/bin/jstack -F 9259
Attaching to process ID 9259, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.6-b01
Deadlock Detection:

No deadlocks found.

Thread 9266: (state = IN_VM)


Thread 9265: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove() @bci=2, line=134 (Interpreted frame)
 - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Interpreted frame)


Thread 9264: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame)
 - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Interpreted frame)


Thread 9260: (state = IN_VM)

 

 

 

Announcements