07-28-2015 11:22 AM
I have been experimenting and googling for many hours, with no luck.
I have a spark streaming app that runs fine in a local spark cluster. Now I need to deploy it on cloudera 5.4.4. I need to be able to start it, have it run in the background continually, and be able to stop it.
I tried this:
$ spark-submit --master yarn-cluster --class MyMain my.jar myArgs
But it just prints these lines endlessly.
15/07/28 17:58:18 INFO Client: Application report for application_1438092860895_0012 (state: RUNNING)
15/07/28 17:58:19 INFO Client: Application report for application_1438092860895_0012 (state: RUNNING)
Question number 1: since it is a streaming app, it needs to run continuously. So how do I run it in a "background" mode? All the examples I can find of submitting spark jobs on yarn seem to assume that the application will do some work and terminate, and therefore that you would want to run it in the foreground. But that is not the case for streaming.
Next up... at this point the app does not seem to be functioning. I figure it could be a bug or misconfiguration on my part, so I tried to look in the logs to see what's happening:
$ yarn logs -applicationId application_1438092860895_012
But it tells me :
/tmp/logs/hdfs/logs/application_1438092860895_0012does not have any log files.
So question number 2: If the application is RUNNING, why does it have no log files?
So eventually I just had to kill it:
$ yarn application -kill application_1438092860895_012
That brings up question number 3: assuming I can eventually get the app launched and running in the background, is "yarn application -kill" the preferred way of stopping it?
Solved! Go to Solution.
07-28-2015 01:33 PM
You can background the spark-submit process like any other linux process, by putting it into the background in the shell. In your case, the spark-submit job actually then runs the driver on YARN, so, it's baby-sitting a process that's already running asynchronously on another machine via YARN. Running is good; it means all is well. You can redirect this log output where you like.
Killing the driver will cause YARN to restart it, in yarn-cluster mode. You want to kill the spark-submit process, really.
I don't know why you don't see logs. Try browing to the Spark UI of the driver to see what's happening.
07-28-2015 02:45 PM
Thank you for this information. I am sure I will need it once I get the application problem resolved. For now I am just running in yarn-client mode so I can see the logs in stdout.
By the way I noticed that AFTER I kill the process the logs DO become available using the "yarn logs" command. Any idea why that would be? Does it buffer the stdout somewhere and only copy it over when the process is done?
P.S. Unfortunately I can't go through the GUI to look at the logs because of a whole can of worms that is not my department. (Namely we are running CDH on AWS and our corporate firewall won't let us access any external port other than 80, 443, and 22. YarnLogs runs on 8042, so we get blocked. It is a stupid setup, I know).
07-28-2015 10:59 PM
That I don't know. THere should be something in the logs at startup, and that should be available pretty soon. I would expect you can see the logs with that command. It could be some other issue with the ports and so on, but then I think you'd see errors from YARN that it can't get to the AM container or something.
08-05-2015 12:56 PM
I'm pretty confused; I appreciate the help and I am willing to read documentation if you can point me to any, rather than having to learn through this forum.
You said I could put it in the background like any other linux job. So I tried, by appending an & to the end of my command:
$ spark-submit --master yarn-cluster --class MyMain my.jar myArgs &
It gave me a PID, like I expected, but then immediately took over my tty again and began spitting out log messages:
 7812 hdfs@ip-10-183-0-135:/home/ubuntu$ 15/08/05 19:15:47 INFO RMProxy: Connecting to ResourceManager at ip-10-183-0-48/10.183.0.48:8032 15/08/05 19:15:47 INFO Client: Requesting a new application from cluster with 6 NodeManagers 15/08/05 19:15:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 15/08/05 19:15:47 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 15/08/05 19:15:47 INFO Client: Setting up container launch context for our AM 15/08/05 19:15:47 INFO Client: Preparing resources for our AM container ...etc...
I also tried redirecting this ouptut by appending > /dev/null but that didn't seem to help. It still writes logs into my tty.
The only thing I could find that kind of works is submitting it without the &, and then doing a ctrl-z.
15/08/05 19:23:07 INFO Client: Application report for application_1438787889069_0015 (state: RUNNING) 15/08/05 19:23:08 INFO Client: Application report for application_1438787889069_0015 (state: RUNNING) ^Z Exit 1 spark-submit --master yarn-cluster --class MyMain my.jar myArgs + Stopped spark-submit --master yarn-cluster --class MyMain myjar myArgs
However if I put this suspended job into the background, it starts spitting out its log messages again:
hdfs@ip-10-183-0-135:/home/ubuntu$ bg + spark-submit --master yarn-cluster --class MyMain my.jar MyArgs hdfs@ip-10-183-0-135:/home/ubuntu$ 15/08/05 19:46:18 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING) 15/08/05 19:46:19 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING) 15/08/05 19:46:20 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING) 15/08/05 19:46:21 INFO Client: Application report for application_1438787889069_0017 (state: RUNNING) ...
At this point, ctrl-z has no effect.
Even though I don't understand this, I am happy to see that if I just suspend the spark-submit process with ctrl-z (but don't ever put it in the background) I can see that my job is running:
hdfs@ip-10-183-0-135:/home/ubuntu$ yarn application -list ... Application-Id Application-Name application_1438787889069_0015 MyMain ...
But something else I am seeing doesn't jive with what I understand you to be saying... you said if I wanted to stop it I couldn't do the yarn application -kill, since it would just get rescheduled. Rather I should kill the spark-submit process. But in fact doing a yarn application -kill does do the job:
hdfs@ip-10-183-0-135:/home/ubuntu$ yarn application -kill application_1438787889069_0015 Killing application application_1438787889069_0015 15/08/05 19:36:09 INFO impl.YarnClientImpl: Killed application application_1438787889069_0015 hdfs@ip-10-183-0-135:/home/ubuntu$ yarn application -list 15/08/05 19:36:17 INFO client.RMProxy: Connecting to ResourceManager at ip-10-183-0-48/10.183.0.48:8032 Total number of applications (application-types:  and states: [SUBMITTED, ACCEPTED, RUNNING]):0 Application-Id Application-Name
So based on what you told me vs what I'm seeing... I'm pretty confused about what is going on and what is the right way to go about doing this. To recap, I just want to be able to issue a command to submit the job, and later, issue a command to stop the job.
As I said, I happy to read docs if you can point me in the right direction, I have just been unable to find an explanation, step-by-step, of exactly how to do this.
09-01-2015 12:14 AM
Use "nohup spark-submit <parameters> 2>&1 < /dev/null &"
nohup will not print the Spark stremaing logs on Terminal and it runs as a background process..
05-08-2017 10:53 AM
Since Spark 1.6.1 spark-submit takes no wait option. I bet many people faced the same problem :)