Support Questions

Find answers, ask questions, and share your expertise

Install the clouder Management services caught high disk IO

avatar
New Contributor

hi

I am installing CDH4.4 on Centos 6.3 (visual machine , 2vcpu/2GB MEM/20GB) using Cloudera Manager through installation path A, I have two problems which has troubled me for a long time:

1¡¢ The first implementation of the "create a temporary directory" will always fail ,no exception log in the /var/run/cloudera-scm-agent/process/17-hdfs-NAMENODE-createtmp/logs/stderr.log , I usually try again later will be successful .

I refer to the relevant information in https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/jlgfjQy56So ,and add 127.0.1.1 to my /etc/hosts,but the fault is still in the first time.

2¡¢When I perform to the 'starting Cloudera Management Services' step , my visual machine continued high IOPS ,this cause the virtual machine stop responding and unable to complete the following installation.
  I execute the command "vmstat 2" ,that found more than 30 processes are waiting for the CPU scheduling and the following errors were found in /var/log/cloudera-scm-agent/cloudera-scm-agent.log
 
03/Oct/2013 21:07:23 +0000] 32181 Monitor-HostMonitor throttling_logger ERROR    (315 skipped) Failed to collect java-based DNS names
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 53, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._poll_timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 42, in _subprocess_with_timeout
    return subprocess_with_timeout(args, timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 40, in subprocess_with_timeout
    close_fds=True)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
[03/Oct/2013 21:07:50 +0000] 32181 Monitor-DataNodeMonitor abstract_monitor ERROR    Error fetching metrics at 'http://cdh1.jsnewland.com:50075/jmx'
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 252, in collect_metrics_from_url
    openedUrl = self.urlopen(url, username=username, password=password)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 234, in urlopen
    password=password)
  File "/usr/lib64/cmf/agent/src/cmf/url_util.py", line 39, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
[03/Oct/2013 21:07:50 +0000] 32181 Monitor-NameNodeMonitor abstract_monitor ERROR    Error fetching metrics at 'http://cdh1.jsnewland.com:50070/jmx'
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 252, in collect_metrics_from_url
    openedUrl = self.urlopen(url, username=username, password=password)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 234, in urlopen
    password=password)
  File "/usr/lib64/cmf/agent/src/cmf/url_util.py", line 39, in urlopen_with_timeout
    return opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
[03/Oct/2013 21:07:56 +0000] 32181 MainThread agent        ERROR    Heartbeating to 192.168.125.135:7182 failed.
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 747, in send_heartbeat
    response = self.requestor.request('heartbeat', dict(request=heartbeat))
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 145, in request
    return self.issue_request(call_request, message_name, request_datum)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 256, in issue_request
    call_response = self.transceiver.transceive(call_request)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 485, in transceive
    result = self.read_framed_message()
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 489, in read_framed_message
    response = self.conn.getresponse()
  File "/usr/lib64/python2.6/httplib.py", line 990, in getresponse
    response.begin()
  File "/usr/lib64/python2.6/httplib.py", line 391, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.6/httplib.py", line 349, in _read_status
    line = self.fp.readline()
  File "/usr/lib64/python2.6/socket.py", line 433, in readline
    data = recv(1)
timeout: timed out


   Although the log display is the connection timed out , but I can still access the URL http://cdh1.jsnewland.com:50075/jmx through Mozilla firework on the host cdh1 , the browser will return some JSON format information.

 

Thanks.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

In your hosts file; do not comment out the loopback interface (127.0.0.1) just let that be its normal values, you can allow the ipv6 value to be set as well, it is not necessary to comment either of those out.

From your command line in the shell, do a "getent hosts cdh1.jsnewland.com" and "getent hosts 192.168.125.135" to verify name resolution is doing what you want.  If it comes back with unexpected values, verify in your vm's that /etc/nsswitch.conf is set for  "hosts   files dns" in that order, rather than "hosts   dns  files".

What is the host OS you are using for the VM? If you had a 8GB system, you would be much better off running a single  3 to 4 GB VM.  You need to realize the parent OS (especially if GUI desktop is in use) is going to need memory, including overhead to run the actual vm servers and instances.

At this scale of physical system (6GB RAM); attempting to emulate a cluster of 3 x 2GB nodes is going to get in the way of your attempt to use hadoop.  Take a look at our Example VM that is available for download, it's set up to run in a laptop/desktop configuration.  The sample vm uses 4GB as its base memory configuration.

For the vmstat information you provided, here is the breakdown of what it is telling you:

The attached file vmstat-test.txt is a test of making a path 12 times on a VM with 12GB RAM with 6GB swap configured, on a physical host with 128GB ram. Note the differences from your output.

Note in the explanation of the column titles for vmstat, my tag of "<---" below indicate what you should focus on when evaluating vmstat output.  Compare your vmstat to the test I did in the attached file.

You are heavily swapping.  It's not a question of being out of swap (it would crash at that point), its the volume of activity of paging back and forth that is literally choking the VM.  Below is your vmstat re-pasted:

# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 21 913364  52192    640  20384  212  411   673   441  266  521  8  3 67 22  0
 1 23 912360  60864    496  19840 1398  572  1912   576  356  578  8  3  0 90  0
 0 17 911292  57136    504  19892 2200  304  2256   304  340  621  4  3  0 93  0
 1 17 909904  50180    536  22032 1530   18  2600    18  376  630  7  5  0 88  0
 1 15 908268  49372    536  23460 1906   22  2614    22  341  643  3  3  0 95  0
 2 19 906812  49084    544  25304 1838    0  3014     0  328  778  3  5  0 92  0
 0 15 906036  49152    532  26032 1582  220  2500   220  297  540  3  5  0 92  0
 3 16 908180  62844    536  23092 1460 2220  2286  2220  477  591 18  8  0 74  0
 2 12 906608  58860    536  25644 1830    0  3120     0  440  603 10 11  0 79  0
 3 21 904808  53412    536  26244 2370    0  2668     0  578  767  5  9  0 86  0

Now to understand how to read the vmstat output.

Procs
       r: The number of processes waiting for run time.
       b: The number of processes in uninterruptible sleep. <<---

   Memory
       swpd: the amount of virtual memory used. <<----
       free: the amount of idle memory.
       buff: the amount of memory used as buffers.
       cache: the amount of memory used as cache.
       inact: the amount of inactive memory. (-a option)
       active: the amount of active memory. (-a option)

   Swap
       si: Amount of memory swapped in from disk (/s).  <<<----
       so: Amount of memory swapped to disk (/s).         <<<----

   IO
       bi: Blocks received from a block device (blocks/s).
       bo: Blocks sent to a block device (blocks/s).

   System
       in: The number of interrupts per second, including the clock.  <----
       cs: The number of context switches per second.    <----

   CPU
       These are percentages of total CPU time.
       us: Time spent running non-kernel code. (user time, including nice time)
       sy: Time spent running kernel code. (system time)
       id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
       wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.  <<<----
       st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

View solution in original post

5 REPLIES 5

avatar
Master Collaborator

(pasted from mail thread discussion)

 

2GB is going to be tough to prevent swapping of the vm back and for between disk and ram... how much physical ram is available on the machine you are running the VM on? We run with 4GB in the demo VM (that might be worth downloading and using to check things out).

Also what does the following commands show in your VM.

# hostname

and

# ifconfig -a

and

# cat /etc/hosts 

avatar
Master Collaborator
(pasted from mail thread discussion)
 
    Thanks for your replay.
 
When the high disk IO fault occurs, there are still some remaining memory on my vm. The following is the fault occurs, the "vmstat 2" command returns information
 
# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 21 913364  52192    640  20384  212  411   673   441  266  521  8  3 67 22  0
 1 23 912360  60864    496  19840 1398  572  1912   576  356  578  8  3  0 90  0
 0 17 911292  57136    504  19892 2200  304  2256   304  340  621  4  3  0 93  0
 1 17 909904  50180    536  22032 1530   18  2600    18  376  630  7  5  0 88  0
 1 15 908268  49372    536  23460 1906   22  2614    22  341  643  3  3  0 95  0
 2 19 906812  49084    544  25304 1838    0  3014     0  328  778  3  5  0 92  0
 0 15 906036  49152    532  26032 1582  220  2500   220  297  540  3  5  0 92  0
 3 16 908180  62844    536  23092 1460 2220  2286  2220  477  591 18  8  0 74  0
 2 12 906608  58860    536  25644 1830    0  3120     0  440  603 10 11  0 79  0
 3 21 904808  53412    536  26244 2370    0  2668     0  578  767  5  9  0 86  0
 
 
The physical machine has 6GB memory. My CDH cluster has three hosts , they are all running on my physical machine. The Cloudera Manager are install on the cdh1.
 
$ hostname
cdh1
 
$ ifconfig -a
eth0      Link encap:Ethernet  
          inet addr:192.168.125.135  Bcast:192.168.125.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fec3:dda3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:97 errors:0 dropped:0 overruns:0 frame:0
          TX packets:144 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:67253 (65.6 KiB)  TX bytes:13328 (13.0 KiB)
 
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:15359 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15359 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:17372886 (16.5 MiB)  TX bytes:17372886 (16.5 MiB)
 
$ cat /etc/hosts
#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
 
192.168.125.135 cdh1.jsnewland.com cdh1
192.168.125.136 cdh2.jsnewland.com cdh2
192.168.125.137 cdh3.jsnewland.com cdh3

avatar
Master Collaborator

In your hosts file; do not comment out the loopback interface (127.0.0.1) just let that be its normal values, you can allow the ipv6 value to be set as well, it is not necessary to comment either of those out.

From your command line in the shell, do a "getent hosts cdh1.jsnewland.com" and "getent hosts 192.168.125.135" to verify name resolution is doing what you want.  If it comes back with unexpected values, verify in your vm's that /etc/nsswitch.conf is set for  "hosts   files dns" in that order, rather than "hosts   dns  files".

What is the host OS you are using for the VM? If you had a 8GB system, you would be much better off running a single  3 to 4 GB VM.  You need to realize the parent OS (especially if GUI desktop is in use) is going to need memory, including overhead to run the actual vm servers and instances.

At this scale of physical system (6GB RAM); attempting to emulate a cluster of 3 x 2GB nodes is going to get in the way of your attempt to use hadoop.  Take a look at our Example VM that is available for download, it's set up to run in a laptop/desktop configuration.  The sample vm uses 4GB as its base memory configuration.

For the vmstat information you provided, here is the breakdown of what it is telling you:

The attached file vmstat-test.txt is a test of making a path 12 times on a VM with 12GB RAM with 6GB swap configured, on a physical host with 128GB ram. Note the differences from your output.

Note in the explanation of the column titles for vmstat, my tag of "<---" below indicate what you should focus on when evaluating vmstat output.  Compare your vmstat to the test I did in the attached file.

You are heavily swapping.  It's not a question of being out of swap (it would crash at that point), its the volume of activity of paging back and forth that is literally choking the VM.  Below is your vmstat re-pasted:

# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 21 913364  52192    640  20384  212  411   673   441  266  521  8  3 67 22  0
 1 23 912360  60864    496  19840 1398  572  1912   576  356  578  8  3  0 90  0
 0 17 911292  57136    504  19892 2200  304  2256   304  340  621  4  3  0 93  0
 1 17 909904  50180    536  22032 1530   18  2600    18  376  630  7  5  0 88  0
 1 15 908268  49372    536  23460 1906   22  2614    22  341  643  3  3  0 95  0
 2 19 906812  49084    544  25304 1838    0  3014     0  328  778  3  5  0 92  0
 0 15 906036  49152    532  26032 1582  220  2500   220  297  540  3  5  0 92  0
 3 16 908180  62844    536  23092 1460 2220  2286  2220  477  591 18  8  0 74  0
 2 12 906608  58860    536  25644 1830    0  3120     0  440  603 10 11  0 79  0
 3 21 904808  53412    536  26244 2370    0  2668     0  578  767  5  9  0 86  0

Now to understand how to read the vmstat output.

Procs
       r: The number of processes waiting for run time.
       b: The number of processes in uninterruptible sleep. <<---

   Memory
       swpd: the amount of virtual memory used. <<----
       free: the amount of idle memory.
       buff: the amount of memory used as buffers.
       cache: the amount of memory used as cache.
       inact: the amount of inactive memory. (-a option)
       active: the amount of active memory. (-a option)

   Swap
       si: Amount of memory swapped in from disk (/s).  <<<----
       so: Amount of memory swapped to disk (/s).         <<<----

   IO
       bi: Blocks received from a block device (blocks/s).
       bo: Blocks sent to a block device (blocks/s).

   System
       in: The number of interrupts per second, including the clock.  <----
       cs: The number of context switches per second.    <----

   CPU
       These are percentages of total CPU time.
       us: Time spent running non-kernel code. (user time, including nice time)
       sy: Time spent running kernel code. (system time)
       id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
       wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.  <<<----
       st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

avatar
Master Collaborator

(text from the attached file "vmstat-test.txt" - from the mail thread) 

 

[root@cehd3 ~]# for i in {1..12}; do date ; echo "sudo -u hdfs hadoop fs -mkdir /foo$i";sudo -u hdfs hadoop fs -mkdir /foo$i; done
Fri Oct 4 09:57:24 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo1
Fri Oct 4 09:57:26 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo2
Fri Oct 4 09:57:27 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo3
Fri Oct 4 09:57:29 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo4
Fri Oct 4 09:57:31 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo5
Fri Oct 4 09:57:33 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo6
Fri Oct 4 09:57:35 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo7
Fri Oct 4 09:57:36 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo8
Fri Oct 4 09:57:38 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo9
Fri Oct 4 09:57:40 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo10
Fri Oct 4 09:57:42 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo11
Fri Oct 4 09:57:44 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo12


procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ ---timestamp---
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 10875508 261608 560260 0 0 0 5 17 9 0 0 99 0 0 2013-10-04 09:57:22 MDT
1 0 0 10857096 261608 560260 0 0 0 42 299 519 8 1 91 0 0 2013-10-04 09:57:24 MDT
2 0 0 10842448 261608 560292 0 0 0 0 1150 845 94 5 1 0 0 2013-10-04 09:57:26 MDT
1 0 0 10835436 261608 560292 0 0 0 0 1051 838 96 4 0 0 0 2013-10-04 09:57:28 MDT
3 0 0 10826628 261608 560292 0 0 0 44 1105 874 92 7 1 0 0 2013-10-04 09:57:30 MDT
3 0 0 10820764 261608 560296 0 0 0 0 1058 858 96 4 0 0 0 2013-10-04 09:57:32 MDT
2 0 0 10815676 261608 560332 0 0 0 2 1118 899 94 6 1 0 0 2013-10-04 09:57:34 MDT
3 0 0 10794528 261608 560300 0 0 0 34 1118 838 95 5 0 0 0 2013-10-04 09:57:36 MDT
1 0 0 10777340 261608 560300 0 0 0 22 1086 823 95 4 1 0 0 2013-10-04 09:57:38 MDT
4 0 0 10864748 261608 560300 0 0 0 20 1123 964 93 7 1 0 0 2013-10-04 09:57:40 MDT
1 0 0 10849984 261608 560300 0 0 0 28 1027 791 95 5 0 0 0 2013-10-04 09:57:42 MDT
3 0 0 10829008 261608 560300 0 0 0 0 1037 946 93 6 1 0 0 2013-10-04 09:57:44 MDT

avatar
New Contributor
Thanks to Todd's help, thanks to everyone! 
 
When I adjusted my virtual machine memory to 4G, the install successfully and the disk doesn't occur heavily swapping again. 
 
Summary of the installation process, when some steps of failure occurred during the installation process, in most cases, try again and wait to complete installation. 
Such as the HDFS create /tmp failure , generally retry can be successful, no need to add 127.0.1.1 /etc/hosts.
 
Thanks again