Member since
07-28-2015
11
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3390 | 09-02-2015 01:20 PM |
09-02-2015
01:20 PM
I was installing it locally on Mac OSX for testing purposes, so I needed the tarball. Eventually I found a random blog post that pointed me here: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_vd_cdh_package_tarball.html So I got what I needed. Thanks anyway! 🙂
... View more
09-01-2015
01:30 PM
Am I blind? I see no link or button to download CDH 5.4.5 at http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-4-5.html
... View more
Labels:
08-05-2015
12:56 PM
I'm pretty confused; I appreciate the help and I am willing to read documentation if you can point me to any, rather than having to learn through this forum. You said I could put it in the background like any other linux job. So I tried, by appending an & to the end of my command: $ spark-submit --master yarn-cluster --class MyMain my.jar myArgs & It gave me a PID, like I expected, but then immediately took over my tty again and began spitting out log messages: [3] 7812
hdfs@ip-10-183-0-135:/home/ubuntu$ 15/08/05 19:15:47 INFO RMProxy: Connecting to ResourceManager at ip-10-183-0-48/10.183.0.48:8032
15/08/05 19:15:47 INFO Client: Requesting a new application from cluster with 6 NodeManagers
15/08/05 19:15:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/08/05 19:15:47 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/05 19:15:47 INFO Client: Setting up container launch context for our AM
15/08/05 19:15:47 INFO Client: Preparing resources for our AM container
...etc... I also tried redirecting this ouptut by appending > /dev/null but that didn't seem to help. It still writes logs into my tty. The only thing I could find that kind of works is submitting it without the &, and then doing a ctrl-z. 15/08/05 19:23:07 INFO Client: Application report for application_1438787889069_0015 (state: RUNNING)
15/08/05 19:23:08 INFO Client: Application report for application_1438787889069_0015 (state: RUNNING)
^Z[4] Exit 1 spark-submit --master yarn-cluster --class MyMain my.jar myArgs
[5]+ Stopped spark-submit --master yarn-cluster --class MyMain myjar myArgs However if I put this suspended job into the background, it starts spitting out its log messages again: hdfs@ip-10-183-0-135:/home/ubuntu$ bg
[5]+ spark-submit --master yarn-cluster --class MyMain my.jar MyArgs
hdfs@ip-10-183-0-135:/home/ubuntu$ 15/08/05 19:46:18 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING)
15/08/05 19:46:19 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING)
15/08/05 19:46:20 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING)
15/08/05 19:46:21 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING)
... At this point, ctrl-z has no effect. Even though I don't understand this, I am happy to see that if I just suspend the spark-submit process with ctrl-z (but don't ever put it in the background) I can see that my job is running: hdfs@ip-10-183-0-135:/home/ubuntu$ yarn application -list
...
Application-Id Application-Name
application_1438787889069_0015 MyMain
... But something else I am seeing doesn't jive with what I understand you to be saying... you said if I wanted to stop it I couldn't do the yarn application -kill, since it would just get rescheduled. Rather I should kill the spark-submit process. But in fact doing a yarn application -kill does do the job: hdfs@ip-10-183-0-135:/home/ubuntu$ yarn application -kill application_1438787889069_0015
Killing application application_1438787889069_0015
15/08/05 19:36:09 INFO impl.YarnClientImpl: Killed application application_1438787889069_0015
hdfs@ip-10-183-0-135:/home/ubuntu$ yarn application -list
15/08/05 19:36:17 INFO client.RMProxy: Connecting to ResourceManager at ip-10-183-0-48/10.183.0.48:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
Application-Id Application-Name
So based on what you told me vs what I'm seeing... I'm pretty confused about what is going on and what is the right way to go about doing this. To recap, I just want to be able to issue a command to submit the job, and later, issue a command to stop the job. As I said, I happy to read docs if you can point me in the right direction, I have just been unable to find an explanation, step-by-step, of exactly how to do this.
... View more
07-28-2015
02:45 PM
Thank you for this information. I am sure I will need it once I get the application problem resolved. For now I am just running in yarn-client mode so I can see the logs in stdout. By the way I noticed that AFTER I kill the process the logs DO become available using the "yarn logs" command. Any idea why that would be? Does it buffer the stdout somewhere and only copy it over when the process is done? P.S. Unfortunately I can't go through the GUI to look at the logs because of a whole can of worms that is not my department. (Namely we are running CDH on AWS and our corporate firewall won't let us access any external port other than 80, 443, and 22. YarnLogs runs on 8042, so we get blocked. It is a stupid setup, I know).
... View more
07-28-2015
11:22 AM
I have been experimenting and googling for many hours, with no luck. I have a spark streaming app that runs fine in a local spark cluster. Now I need to deploy it on cloudera 5.4.4. I need to be able to start it, have it run in the background continually, and be able to stop it. I tried this: $ spark-submit --master yarn-cluster --class MyMain my.jar myArgs But it just prints these lines endlessly. 15/07/28 17:58:18 INFO Client: Application report for application_1438092860895_0012 (state: RUNNING) 15/07/28 17:58:19 INFO Client: Application report for application_1438092860895_0012 (state: RUNNING) Question number 1: since it is a streaming app, it needs to run continuously. So how do I run it in a "background" mode? All the examples I can find of submitting spark jobs on yarn seem to assume that the application will do some work and terminate, and therefore that you would want to run it in the foreground. But that is not the case for streaming. Next up... at this point the app does not seem to be functioning. I figure it could be a bug or misconfiguration on my part, so I tried to look in the logs to see what's happening: $ yarn logs -applicationId application_1438092860895_012 But it tells me : /tmp/logs/hdfs/logs/application_1438092860895_0012does not have any log files. So question number 2: If the application is RUNNING, why does it have no log files? So eventually I just had to kill it: $ yarn application -kill application_1438092860895_012 That brings up question number 3: assuming I can eventually get the app launched and running in the background, is "yarn application -kill" the preferred way of stopping it?
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN
-
HDFS