We use Oozie to orchestrate the flow of our application. The workflow includes Spark, Java and Bash actions.
We are want to aggregate all the logs to one location using the ELK stack.
The problems are two-fold.
1) Getting the logs while the workflow is still running.
2) Low retention times for yarn logs after the application is in completed state, which leaves very small window to collect the logs before they are lost.
Does anyone have any experience with this, and can you offer any suggestions?
Also, please note we do not have admin rights on the cluster to change any configurations.
In an ideal world we would want the Oozie, Yarn and application logs to be streamed as they occur.
There are several solutions we're considering:
1) Using the Oozie REST api, poll for RUNNING workflows and get the ApplicationId of the Yarn jobs.
2) Using the Yarn REST api, poll for task attempts, and get the logs on the nodes via the api
3) Dump the logs somewhere
4)Use Filebeat to listen to logs path
1)Wait for Oozie workflow to complete
2)Using the Oozie REST api, get all ApplicationIds of the Yarn jobs submitted by the sub-workflows.
2)Collect aggregated logs with "yarn logs -applicationId <ApplicationId>"
3)Use Filebeat to listen to logs path
1)In all applications handled by oozie workflow write logs to tcp socket appender.
*This would mean that the YARN nodes would be streaming the log events as they occur.
However logs from YARN and Oozie would not be persisted this way, only the application logs would be saved, so we're still losing them unless there is a way to force YARN to use an additional log4j appender for specific jobs only.
Thanks in advance for any help you can offer.
For your second issue. Depending on whether you have turned on log aggregation yarn.log-aggregation-enable or not you can increase the time to retain the logs to as long as you want by setting the appropriate retain setting:
For the first issue: if you really want to get the logs in multiple places then you should be able to do whatever log4j supports by changing the config. You can setup a container specific log configuration via the container-log4j.properties file. Passing that changed file in to the different applications is application dependent (see MAPREDUCE-6052 for example.
Hi Can you pleae let me know how did you acheive, which approach you used and how did you go ahead with the second solution
how to `Using the Oozie REST api, poll for RUNNING workflows and get the ApplicationId of the Yarn jobs` ?
I used oozie 4.2.0, and execute a spark job. When I see the logs in hue, however, I found 2 application ids. one is binding to workflow action external Id, but it is not really spark yarn application. The other one is, but I cannot get it!