06-21-2017 11:31 AM
Hello, after successfully upgrading a small (5 nodes) CDH 5.7 cluster to CDH 5.11, I am experiencing various problems on existing Oozie Workflows that used to work correctly.
The most significant example: I have this Workflow scheduling 8 jobs in parallel (mix of Hive, Shell and Sqoop actions). The 8 jobs are acquired and start running. But the 8 sub-jobs performing the action are stuck in "ACCEPTED" status and never switch to "RUNNING" state.
After hours of work, I've not been able to find anything significant in the logs, apart from a few complaining about log4j. So I decided to upgrade JDK from 1.7 to 1.8 too, but without any improvement in the situation.
Any help or suggestion pointing me in the right direction in solving this would be very very much appreciated!
06-26-2017 12:37 PM
As I believe that the problem is definitely due to differences betweek CDH 5.7 and CDH 5.11 in how resources are allocated to containers by YARN, I've tried to follow again from scratch the YARN Tuning Guide.
The latest version of the YARN Tuning Guide available is apparently for CDH 5.10:
In that page, an XLS Sheet is available to help out planning the various parameters in a correct and working fashion.
No luck. I always find myself with jobs stuck in "ACCEPTED" mode and never starting to run.
I also found this interesting thread suggesting how to configure Dynamic Resource Pools for YARN:
I tried to limit the "number of concurrent jobs" to just 2 in the relevant Configuration Page of the Dynamic Resource Pools, but again, no success.
Can anybody please point me out whatever new feature that could have been implemented in CDH 5.11 and related to YARN Resources Allocation (and that I have not mentioned here), because my Workflows were running smoothly before the upgrade, and now I'm facing heavy troubles!
Workarounds are welcome too, as well as methods for monitoring/tracing resources usage in a way allowing me to understand what parameters I've been set up in a way that is not functional anymore in CDH 5.11
Thanks a lot for any hints or insights!
06-27-2017 06:42 AM
In the end I've been able to solve the issue.
I've been tricked by the fact that applying again from scratch the "YARN Resources Allocation Tuning Guide" proposed a (in my opinion) misleading way of calculating a few important parameters. Guide can be found here:
In a matter of fact, the Guide contains a downloadable XLS file which is a tool for calculating optimal parameters. This XLS automatically calculates and proposes a few values to be assigned to YARN configuration:
As you can see above, at step 4 I got proposed "2" for "yarn.nodemanager.resource.cpu-vcores" and "5632" for "yarn.nodemanager.resource.memory-mb"
I later found out that the correct values to be assigned to those configurations are the 2 values proposed at "step 5"
Definitely, partly my fault (I do not have deep knowledge of YARN configuration). But partly misleading doc indeed.
I am now fine tuning, trying different settings for the various java heap sizes etc
Still I have no idea why everything was working fine until recently and stopped working after upgrading to 5.11, as I did not change any configuration while upgrading and physical resources are identical