Support Questions

CDH User · ‎10-03-2013

hi

I am installing CDH4.4 on Centos 6.3 (visual machine , 2vcpu/2GB MEM/20GB) using Cloudera Manager through installation path A, I have two problems which has troubled me for a long time:

1¡¢ The first implementation of the "create a temporary directory" will always fail ,no exception log in the /var/run/cloudera-scm-agent/process/17-hdfs-NAMENODE-createtmp/logs/stderr.log , I usually try again later will be successful .

I refer to the relevant information in https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/jlgfjQy56So ,and add 127.0.1.1 to my /etc/hosts,but the fault is still in the first time.

2¡¢When I perform to the 'starting Cloudera Management Services' step , my visual machine continued high IOPS ,this cause the virtual machine stop responding and unable to complete the following installation.
I execute the command "vmstat 2" ,that found more than 30 processes are waiting for the CPU scheduling and the following errors were found in /var/log/cloudera-scm-agent/cloudera-scm-agent.log

03/Oct/2013 21:07:23 +0000] 32181 Monitor-HostMonitor throttling_logger ERROR    (315 skipped) Failed to collect java-based DNS names
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 53, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._poll_timeout)
File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 42, in _subprocess_with_timeout
    return subprocess_with_timeout(args, timeout)
File "/usr/lib64/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 40, in subprocess_with_timeout
    close_fds=True)
File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
[03/Oct/2013 21:07:50 +0000] 32181 Monitor-DataNodeMonitor abstract_monitor ERROR    Error fetching metrics at 'http://cdh1.jsnewland.com:50075/jmx'
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 252, in collect_metrics_from_url
    openedUrl = self.urlopen(url, username=username, password=password)
File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 234, in urlopen
    password=password)
File "/usr/lib64/cmf/agent/src/cmf/url_util.py", line 39, in urlopen_with_timeout
    return opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
    '_open', req)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
    return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
[03/Oct/2013 21:07:50 +0000] 32181 Monitor-NameNodeMonitor abstract_monitor ERROR    Error fetching metrics at 'http://cdh1.jsnewland.com:50070/jmx'
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 252, in collect_metrics_from_url
    openedUrl = self.urlopen(url, username=username, password=password)
File "/usr/lib64/cmf/agent/src/cmf/monitor/abstract_monitor.py", line 234, in urlopen
    password=password)
File "/usr/lib64/cmf/agent/src/cmf/url_util.py", line 39, in urlopen_with_timeout
    return opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
File "/usr/lib64/python2.6/urllib2.py", line 409, in _open
    '_open', req)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 1190, in http_open
    return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.6/urllib2.py", line 1165, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
[03/Oct/2013 21:07:56 +0000] 32181 MainThread agent        ERROR    Heartbeating to 192.168.125.135:7182 failed.
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 747, in send_heartbeat
    response = self.requestor.request('heartbeat', dict(request=heartbeat))
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 145, in request
    return self.issue_request(call_request, message_name, request_datum)
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 256, in issue_request
    call_response = self.transceiver.transceive(call_request)
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 485, in transceive
    result = self.read_framed_message()
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 489, in read_framed_message
    response = self.conn.getresponse()
File "/usr/lib64/python2.6/httplib.py", line 990, in getresponse
    response.begin()
File "/usr/lib64/python2.6/httplib.py", line 391, in begin
    version, status, reason = self._read_status()
File "/usr/lib64/python2.6/httplib.py", line 349, in _read_status
    line = self.fp.readline()
File "/usr/lib64/python2.6/socket.py", line 433, in readline
    data = recv(1)
timeout: timed out

Although the log display is the connection timed out , but I can still access the URL http://cdh1.jsnewland.com:50075/jmx through Mozilla firework on the host cdh1 , the browser will return some JSON format information.

Thanks.

Grizzly · ‎10-04-2013

In your hosts file; do not comment out the loopback interface (127.0.0.1) just let that be its normal values, you can allow the ipv6 value to be set as well, it is not necessary to comment either of those out.

From your command line in the shell, do a "getent hosts cdh1.jsnewland.com" and "getent hosts 192.168.125.135" to verify name resolution is doing what you want. If it comes back with unexpected values, verify in your vm's that /etc/nsswitch.conf is set for "hosts   files dns" in that order, rather than "hosts   dns files".

What is the host OS you are using for the VM? If you had a 8GB system, you would be much better off running a single 3 to 4 GB VM. You need to realize the parent OS (especially if GUI desktop is in use) is going to need memory, including overhead to run the actual vm servers and instances.

At this scale of physical system (6GB RAM); attempting to emulate a cluster of 3 x 2GB nodes is going to get in the way of your attempt to use hadoop. Take a look at our Example VM that is available for download, it's set up to run in a laptop/desktop configuration. The sample vm uses 4GB as its base memory configuration.

For the vmstat information you provided, here is the breakdown of what it is telling you:

The attached file vmstat-test.txt is a test of making a path 12 times on a VM with 12GB RAM with 6GB swap configured, on a physical host with 128GB ram. Note the differences from your output.

Note in the explanation of the column titles for vmstat, my tag of "<---" below indicate what you should focus on when evaluating vmstat output. Compare your vmstat to the test I did in the attached file.

You are heavily swapping. It's not a question of being out of swap (it would crash at that point), its the volume of activity of paging back and forth that is literally choking the VM. Below is your vmstat re-pasted:

# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b   swpd   free   buff cache   si   so    bi    bo   in   cs us sy id wa st
2 21 913364 52192    640 20384 212 411   673   441 266 521 8 3 67 22 0
1 23 912360 60864    496 19840 1398 572 1912   576 356 578 8 3 0 90 0
0 17 911292 57136    504 19892 2200 304 2256   304 340 621 4 3 0 93 0
1 17 909904 50180    536 22032 1530   18 2600    18 376 630 7 5 0 88 0
1 15 908268 49372    536 23460 1906   22 2614    22 341 643 3 3 0 95 0
2 19 906812 49084    544 25304 1838    0 3014     0 328 778 3 5 0 92 0
0 15 906036 49152    532 26032 1582 220 2500   220 297 540 3 5 0 92 0
3 16 908180 62844    536 23092 1460 2220 2286 2220 477 591 18 8 0 74 0
2 12 906608 58860    536 25644 1830    0 3120     0 440 603 10 11 0 79 0
3 21 904808 53412    536 26244 2370    0 2668     0 578 767 5 9 0 86 0

Now to understand how to read the vmstat output.

Procs
       r: The number of processes waiting for run time.
       b: The number of processes in uninterruptible sleep. <<---

   Memory
       swpd: the amount of virtual memory used. <<----
       free: the amount of idle memory.
       buff: the amount of memory used as buffers.
       cache: the amount of memory used as cache.
       inact: the amount of inactive memory. (-a option)
       active: the amount of active memory. (-a option)

   Swap
       si: Amount of memory swapped in from disk (/s). <<<----
       so: Amount of memory swapped to disk (/s).         <<<----

   IO
       bi: Blocks received from a block device (blocks/s).
       bo: Blocks sent to a block device (blocks/s).

   System
       in: The number of interrupts per second, including the clock. <----
       cs: The number of context switches per second.    <----

   CPU
       These are percentages of total CPU time.
       us: Time spent running non-kernel code. (user time, including nice time)
       sy: Time spent running kernel code. (system time)
       id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
       wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle. <<<----
       st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

View solution in original post

Grizzly · ‎10-04-2013

(pasted from mail thread discussion)

2GB is going to be tough to prevent swapping of the vm back and for between disk and ram... how much physical ram is available on the machine you are running the VM on? We run with 4GB in the demo VM (that might be worth downloading and using to check things out).

Also what does the following commands show in your VM.

# hostname

and

# ifconfig -a

and

# cat /etc/hosts

Grizzly · ‎10-04-2013

(pasted from mail thread discussion)

Thanks for your replay.

When the high disk IO fault occurs, there are still some remaining memory on my vm. The following is the fault occurs, the "vmstat 2" command returns information

# vmstat 2

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

r b swpd free buff cache si so bi bo in cs us sy id wa st

2 21 913364 52192 640 20384 212 411 673 441 266 521 8 3 67 22 0

1 23 912360 60864 496 19840 1398 572 1912 576 356 578 8 3 0 90 0

0 17 911292 57136 504 19892 2200 304 2256 304 340 621 4 3 0 93 0

1 17 909904 50180 536 22032 1530 18 2600 18 376 630 7 5 0 88 0

1 15 908268 49372 536 23460 1906 22 2614 22 341 643 3 3 0 95 0

2 19 906812 49084 544 25304 1838 0 3014 0 328 778 3 5 0 92 0

0 15 906036 49152 532 26032 1582 220 2500 220 297 540 3 5 0 92 0

3 16 908180 62844 536 23092 1460 2220 2286 2220 477 591 18 8 0 74 0

2 12 906608 58860 536 25644 1830 0 3120 0 440 603 10 11 0 79 0

3 21 904808 53412 536 26244 2370 0 2668 0 578 767 5 9 0 86 0

The physical machine has 6GB memory. My CDH cluster has three hosts , they are all running on my physical machine. The Cloudera Manager are install on the cdh1.

$ hostname

cdh1

$ ifconfig -a

eth0 Link encap:Ethernet

inet addr:192.168.125.135 Bcast:192.168.125.255 Mask:255.255.255.0

inet6 addr: fe80::20c:29ff:fec3:dda3/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:97 errors:0 dropped:0 overruns:0 frame:0

TX packets:144 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:67253 (65.6 KiB) TX bytes:13328 (13.0 KiB)

lo Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:16436 Metric:1

RX packets:15359 errors:0 dropped:0 overruns:0 frame:0

TX packets:15359 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:17372886 (16.5 MiB) TX bytes:17372886 (16.5 MiB)

$ cat /etc/hosts

#127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

#::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.125.135 cdh1.jsnewland.com cdh1

192.168.125.136 cdh2.jsnewland.com cdh2

192.168.125.137 cdh3.jsnewland.com cdh3

Grizzly · ‎10-04-2013

In your hosts file; do not comment out the loopback interface (127.0.0.1) just let that be its normal values, you can allow the ipv6 value to be set as well, it is not necessary to comment either of those out.

From your command line in the shell, do a "getent hosts cdh1.jsnewland.com" and "getent hosts 192.168.125.135" to verify name resolution is doing what you want. If it comes back with unexpected values, verify in your vm's that /etc/nsswitch.conf is set for "hosts   files dns" in that order, rather than "hosts   dns files".

What is the host OS you are using for the VM? If you had a 8GB system, you would be much better off running a single 3 to 4 GB VM. You need to realize the parent OS (especially if GUI desktop is in use) is going to need memory, including overhead to run the actual vm servers and instances.

At this scale of physical system (6GB RAM); attempting to emulate a cluster of 3 x 2GB nodes is going to get in the way of your attempt to use hadoop. Take a look at our Example VM that is available for download, it's set up to run in a laptop/desktop configuration. The sample vm uses 4GB as its base memory configuration.

For the vmstat information you provided, here is the breakdown of what it is telling you:

The attached file vmstat-test.txt is a test of making a path 12 times on a VM with 12GB RAM with 6GB swap configured, on a physical host with 128GB ram. Note the differences from your output.

Note in the explanation of the column titles for vmstat, my tag of "<---" below indicate what you should focus on when evaluating vmstat output. Compare your vmstat to the test I did in the attached file.

You are heavily swapping. It's not a question of being out of swap (it would crash at that point), its the volume of activity of paging back and forth that is literally choking the VM. Below is your vmstat re-pasted:

# vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b   swpd   free   buff cache   si   so    bi    bo   in   cs us sy id wa st
2 21 913364 52192    640 20384 212 411   673   441 266 521 8 3 67 22 0
1 23 912360 60864    496 19840 1398 572 1912   576 356 578 8 3 0 90 0
0 17 911292 57136    504 19892 2200 304 2256   304 340 621 4 3 0 93 0
1 17 909904 50180    536 22032 1530   18 2600    18 376 630 7 5 0 88 0
1 15 908268 49372    536 23460 1906   22 2614    22 341 643 3 3 0 95 0
2 19 906812 49084    544 25304 1838    0 3014     0 328 778 3 5 0 92 0
0 15 906036 49152    532 26032 1582 220 2500   220 297 540 3 5 0 92 0
3 16 908180 62844    536 23092 1460 2220 2286 2220 477 591 18 8 0 74 0
2 12 906608 58860    536 25644 1830    0 3120     0 440 603 10 11 0 79 0
3 21 904808 53412    536 26244 2370    0 2668     0 578 767 5 9 0 86 0

Now to understand how to read the vmstat output.

Procs
       r: The number of processes waiting for run time.
       b: The number of processes in uninterruptible sleep. <<---

   Memory
       swpd: the amount of virtual memory used. <<----
       free: the amount of idle memory.
       buff: the amount of memory used as buffers.
       cache: the amount of memory used as cache.
       inact: the amount of inactive memory. (-a option)
       active: the amount of active memory. (-a option)

   Swap
       si: Amount of memory swapped in from disk (/s). <<<----
       so: Amount of memory swapped to disk (/s).         <<<----

   IO
       bi: Blocks received from a block device (blocks/s).
       bo: Blocks sent to a block device (blocks/s).

   System
       in: The number of interrupts per second, including the clock. <----
       cs: The number of context switches per second.    <----

   CPU
       These are percentages of total CPU time.
       us: Time spent running non-kernel code. (user time, including nice time)
       sy: Time spent running kernel code. (system time)
       id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
       wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle. <<<----
       st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

Grizzly · ‎10-04-2013

(text from the attached file "vmstat-test.txt" - from the mail thread)

[root@cehd3 ~]# for i in {1..12}; do date ; echo "sudo -u hdfs hadoop fs -mkdir /foo$i";sudo -u hdfs hadoop fs -mkdir /foo$i; done
Fri Oct 4 09:57:24 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo1
Fri Oct 4 09:57:26 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo2
Fri Oct 4 09:57:27 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo3
Fri Oct 4 09:57:29 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo4
Fri Oct 4 09:57:31 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo5
Fri Oct 4 09:57:33 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo6
Fri Oct 4 09:57:35 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo7
Fri Oct 4 09:57:36 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo8
Fri Oct 4 09:57:38 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo9
Fri Oct 4 09:57:40 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo10
Fri Oct 4 09:57:42 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo11
Fri Oct 4 09:57:44 MDT 2013
sudo -u hdfs hadoop fs -mkdir /foo12

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ ---timestamp---
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 10875508 261608 560260 0 0 0 5 17 9 0 0 99 0 0 2013-10-04 09:57:22 MDT
1 0 0 10857096 261608 560260 0 0 0 42 299 519 8 1 91 0 0 2013-10-04 09:57:24 MDT
2 0 0 10842448 261608 560292 0 0 0 0 1150 845 94 5 1 0 0 2013-10-04 09:57:26 MDT
1 0 0 10835436 261608 560292 0 0 0 0 1051 838 96 4 0 0 0 2013-10-04 09:57:28 MDT
3 0 0 10826628 261608 560292 0 0 0 44 1105 874 92 7 1 0 0 2013-10-04 09:57:30 MDT
3 0 0 10820764 261608 560296 0 0 0 0 1058 858 96 4 0 0 0 2013-10-04 09:57:32 MDT
2 0 0 10815676 261608 560332 0 0 0 2 1118 899 94 6 1 0 0 2013-10-04 09:57:34 MDT
3 0 0 10794528 261608 560300 0 0 0 34 1118 838 95 5 0 0 0 2013-10-04 09:57:36 MDT
1 0 0 10777340 261608 560300 0 0 0 22 1086 823 95 4 1 0 0 2013-10-04 09:57:38 MDT
4 0 0 10864748 261608 560300 0 0 0 20 1123 964 93 7 1 0 0 2013-10-04 09:57:40 MDT
1 0 0 10849984 261608 560300 0 0 0 28 1027 791 95 5 0 0 0 2013-10-04 09:57:42 MDT
3 0 0 10829008 261608 560300 0 0 0 0 1037 946 93 6 1 0 0 2013-10-04 09:57:44 MDT

CDH User · ‎10-11-2013

Thanks to Todd's help, thanks to everyone!

When I adjusted my virtual machine memory to 4G, the install successfully and the disk doesn't occur heavily swapping again.

Summary of the installation process, when some steps of failure occurred during the installation process, in most cases, try again and wait to complete installation.

Such as the HDFS create /tmp failure , generally retry can be successful, no need to add 127.0.1.1 /etc/hosts.

Thanks again

Cloudera Community

Support Questions

Install the clouder Management services caught high disk IO