Reply
New Contributor
Posts: 9
Registered: ‎01-22-2016

/usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

Hello,

 

On 2 nodes in a cluster of 13 nodes the cloudera program /usr/lib64/cmf/agent/build/env/bin/flood   is using quit a lot of cpu and doing more then 200000 context switches by secondes. it is normal ?

Very small ressource usages on all my other nodes....

Expert Contributor
Posts: 71
Registered: ‎02-23-2018

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

Hi @ydastous,

 

Have you high availability in your nodes/services?

 

Regards,

Manu. 

New Contributor
Posts: 9
Registered: ‎01-22-2016

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

Hello,

I do not think so, I am running Cloudera Express 5.14.4 on this cluster
and these node have 4 roles

Kafka Broker
Kafka MIrrorMaker
YARN
Spark2

ON other nodes i have i have a role about 'hdfs failover controller' this
is as much 'ha' It get i think.

Yves
Highlighted
Expert Contributor
Posts: 71
Registered: ‎02-23-2018

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

Hi, @ydastous

 

If you are using Fair Scheduler configuration, try to set:

https://www.cloudera.com/documentation/enterprise/5-14-x/topics/admin_fair_scheduler.html

 

 

Regards,

Manu.

New Contributor
Posts: 9
Registered: ‎01-22-2016

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

Hello,

Do you have a specific property in mind ?

At the moment I have
in yarn-site.xml:
`org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler`

fair-scheduler.xml:



drf

drf


drf












Posts: 957
Topics: 1
Kudos: 228
Solutions: 121
Registered: ‎04-22-2014

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

@ydastous,

 

I'm not sure how YARN could relate to Cloudera Manager Agents' flood application.

 

"flood" is a service run by each agent that will serve up or download parcels.  It is not associated with YARN or even CDH itself for that matter.

 

If its CPU usage is high, it must be up to something, so we need to figure out what that is.  Let's start with the logs:

 

/var/log/cloudera-scm-agent/cloudera-flood.log

/var/run/cloudera-scm-agent/flood/stderr.log

 

The "cloudera-flood.log" is likely to have the most relevant information.

I would recommend comparing an agent that has low CPU use by flood and the others to see if there are some obvious differences in the logs.

 

If the logs don't seem to indicate anything special, we may need to turn to more tools like strace or pstack.

New Contributor
Posts: 9
Registered: ‎01-22-2016

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

So no activities in the logs after i restart the agent yesterday to see if
this help.

ON a 'normal node doing:
timeout 60 strace -Cfp 23665


% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 1317 select
0.00 0.000000 0 12 sendmsg
0.00 0.000000 0 24 12 recvmsg
0.00 0.000000 0 1203 600 futex
0.00 0.000000 0 1 1 restart_syscall
0.00 0.000000 0 204 epoll_wait
0.00 0.000000 0 12 epoll_ctl
0.00 0.000000 0 251 timerfd_settime
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 3024 613 total


On one of my crazy node:
timeout 60 strace -Cfp 9932
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
65.67 3.100996 4113 754 select
32.74 1.545832 7887 196 epoll_wait
1.58 0.074386 0 303971 151941 futex
0.01 0.000421 2 225 timerfd_settime
0.00 0.000083 2 48 24 recvmsg
0.00 0.000046 2 24 sendmsg
0.00 0.000014 1 24 epoll_ctl
------ ----------- ----------- --------- --------- ----------------

On both the pid I strace is
`root 9932 9461 65 Sep27 ? 15:18:08 python2.7
/usr/lib64/cmf/agent/build/env/bin/flood`


Yves



Posts: 957
Topics: 1
Kudos: 228
Solutions: 121
Registered: ‎04-22-2014

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

@ydastous,

 

I might suggest getting a few consecutive pstacks to get some specifics about what flood is doing.

The stats alone from the strace don't tell me much other than selects are happening more often on the crazy host.  The "select" call user time may indicate that looking at what file descriptors those selects are on would could be interesting.

 

Perhaps you can check the strace output to find out what is being selected.

 

 

New Contributor
Posts: 9
Registered: ‎01-22-2016

Re: /usr/lib64/cmf/agent/build/env/bin/flood using a lot of cpu on some of my cluster node

I restarted the flood agent on the crazy node and done a strace at the same
time. One of the thread under flood does:

some basic initialization
some writes in the stdout log
and began futex crazy

Here the 25 strace log

set_robust_list(0x7f268a6d09e0, 24) = 0
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
-1, 0) = 0x7f2674000000
munmap(0x7f2678000000, 67108864) = 0
mprotect(0x7f2674000000, 135168, PROT_READ|PROT_WRITE) = 0
futex(0x10fd980, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc774f0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc774f0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
fstat(3, {st_mode=S_IFREG|0644, st_size=299598, ...}) = 0
lseek(3, 299598, SEEK_SET) = 299598
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc774f0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
write(3, "[02/Oct/2018 11:49:24 +0000] 223"..., 134) = 134
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xc774f0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
futex(0xc774f0, FUTEX_WAKE_PRIVATE, 1) = 1
fstat(3, {st_mode=S_IFREG|0644, st_size=299732, ...}) = 0
lseek(3, 299732, SEEK_SET) = 299732
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
write(3, "[02/Oct/2018 11:49:24 +0000] 223"..., 138) = 138
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=299870, ...}) = 0
lseek(3, 299870, SEEK_SET) = 299870
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
write(3, "[02/Oct/2018 11:49:24 +0000] 223"..., 131) = 131
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=300001, ...}) = 0
lseek(3, 300001, SEEK_SET) = 300001
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
write(3, "[02/Oct/2018 11:49:24 +0000] 223"..., 135) = 135
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=300136, ...}) = 0
lseek(3, 300136, SEEK_SET) = 300136
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
write(3, "[02/Oct/2018 11:49:24 +0000] 223"..., 134) = 134
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=300270, ...}) = 0
lseek(3, 300270, SEEK_SET) = 300270
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3477, ...}) = 0
write(3, "[02/Oct/2018 11:49:24 +0000] 223"..., 109) = 109
futex(0x122849c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1,
{1538495364, 309278000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x1228470, FUTEX_WAKE_PRIVATE, 1) = 0

the last 2 lines are repeating themself for the rest of the trace....

Yves
Announcements