I have a 4 node cluster with lots of CPU and memory resources. I am running TPC-BigBench which has mix of hive and MapReduce jobs and I see several failed ApplciationMaster attempts. It seems to be a resource issue, although the portal shows enough resources. For a particular workload, if I reboot the machines, it always runs successfully, but when I run it again without rebooting, jobs get stuck in Pending. I also see that sometimes ApplicationMaster fails, but the second attempt suceeds.
Similarly when I run Spark SQL, I see this warning for several minutes (up to 10 min) before it finally allocates resources:
WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.
I am not running any other application on this cluster or machines and clearly the cluster has enough resources to run the applications, but sometimes scheduling has a delay or fails. Any idea what could be the problem? I have checked this forum and searched for similar questions, usually the issue indicates lack of resources, which is not the case in my benchmark.
Can you provide a bit more background for this issue?
Which CDH version are you using? Do you have CM in the cluster or not? You will also need to explain far more about the Resource Manager configuration (scheduler and its config) and the node managers configs. Lots of resources is something that is open for many interpretations...
The last thing is your application requests: what kind of containers do you request for the different jobs you mentioned.
Thank you for getting back to me. Below is more details:
Cloudera Enterprise Data Hub Edition Trial 5.9.0 (had the same issue with 5.8.0)
Java Version 1.7.0.
I have a four node cluster with one dedicated to server roles (HDFS NameNode, HiveServer, ClouderaManagement Server, Yarn Server, etc..) and the other three to worker roles (DataNode, NodeManager).
Each node has a dual socket Xeon E5-2670 with total 24 physical cores (48 with hyper-threading) and 256 GB memory and one PCIe SSD with 3TB of storage.
Yarn configuration options:
Container virtual cpu cores: 44
Container memory : 160GB
Map(Reduce) task memory: 4GB
Map(Reduce) heap size: 3GB
Resource manager shows 480GB memory and 132 vcores as expected.
The application is TPCx-BB a big data benchmark from TPC which I run using Hive/MapReduce engine and Spark SQL and I see scheduling issues in both cases. Below is my spark configuration. I am using less than available resources due to GC overhead I saw:
Executor-memory : 50G
I am not sure how to attach a file here, but I have posted Yarn log for an application that failed on this forum:
The application gets assigned 57 or 62 containers to start within 20 seconds or so. At that point the log just stops. The AM should be doing the main amount of work in this case at the point you are at.
There is no scheduling problem on the RM side. If there would be a scheduling problem there you would not get any containers. Since you have containers assigned the AM should take over and start the containers on the NMs. There is no trace of that at all. This could be due to a number of things. Some points to check would be the connectivity between the AM and the NMs.
One other thing to try is turn on debug logging on the AM and see if we show anything else on the AM side.
For now I would not suspect the RM at all, and not blame the scheduler in the RM.
Thank you for your help. It's good to know RM is not the problem. I reran the same job with JobHistory, ResourceManager and NodeManager logging set to Debug, but still can't figure out why application stops progressing. Sometimes both AM attempts fail and sometimes first one fails, but second attempt is successful. I checked the logs, nothing stood out to me. Connectivity seems to be there, and first container (ApplicationMaster) status is running, but MapReduce tasks never start.
You can see the logs in TPCx-BB group:
I did not include RM log, as it was too big. If you think it can help, I can upload it somewhere else.
Is there a way to get ApplicationMaster log in addition to these? Appreciate your help.
I am still trying to root cause the issue. Here is the log for an application where first attempt at ApplicationMaster fails, and the second one is successful. First ApplicationMaster stops progressing after allocating containers and is finally killed by Yarn. Here is the last entries in container log:
2017-01-04 15:37:37,689 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1483570317962_0001_m_000039_0 TaskAttempt Transitioned from NEW to UNASSIGNED 2017-01-04 15:37:37,690 INFO [Thread-52] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceRequest:<memory:4096, vCores:1> 2017-01-04 15:37:37,709 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1483570317962_0001, File: hdfs://msl-dpe-perf44p10:8020/user/anahita.sh/.staging/job_1483570317962_0001/job_1483570317962_0001_1.jhist 2017-01-04 15:37:38,033 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://msl-dpe-perf44p10:8020] 2017-01-04 15:37:38,611 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:40 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0 2017-01-04 15:37:38,687 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1483570317962_0001: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:490976, vCores:131> knownNMs=3 2017-01-04 15:37:39,743 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 40
The second attempt starts similarly but continues to allocating containers and complete sucessfully:
2017-01-04 15:48:40,432 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 40 2017-01-04 15:53:11,297 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1483570317962_0001_02_000002 to attempt_1483570317962_0001_m_000000_1000 2017-01-04 15:53:11,299 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1483570317962_0001_02_000003 to attempt_1483570317962_0001_m_000001_1000 2017-01-04 15:53:11,299 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1483570317962_0001_02_000004 to attempt_1483570317962_0001_m_000002_1000
I am at loss on what the problem is and how I should debug it. Do you still believe it is not the Resource Manager? Is there anyway to get more detailed log for application master container?
I have uploaded the log and resource manager log to my google drive:
application log (Check for containers container_1483570317962_0001_01_000001 and container_1483570317962_0001_02_000001)
Since the AM runs for a while but does not progress I would recommend that you grab a sequence of stack traces and see what it is doing in the time between getting the containers assigned and getting killed. At the point that the log message about the 40 containers is logged is the point that you want to grab the stack traces.
I still see this as an AM issue due to the fact that there are a set of containers assigned and then nothing happens.
The RM has given the AM resources to start containers and it now needs to do its work to assign tasks to it.
I got the debug log for AM container. After containers are allocated, they get assigned:
2017-01-18 16:41:57,961 DEBUG [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning container container_1484164132320_0296_01_000002 with priority 20 to NM msl-dpe-perf43p10:8041 2017-01-18 16:41:57,961 DEBUG [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning container container_1484164132320_0296_01_000003 with priority 20 to NM msl-dpe-perf43p10:8041 ...
And the rest of the messages seem like checking connection:
2017-01-18 16:52:16,359 DEBUG [Socket Reader #1 for port 53780] org.apache.hadoop.ipc.Server: got #1905 2017-01-18 16:52:16,359 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 53780: org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getTaskAttemptCompletionEvents from 10.11.10.167:54473 Call#1905 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER 2017-01-18 16:52:16,359 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:anahita.sh (auth:SIMPLE) from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) 2017-01-18 16:52:16,374 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: Served: getTaskAttemptCompletionEvents queueTime= 4 procesingTime= 11 2017-01-18 16:52:16,374 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 53780: responding to org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getTaskAttemptCompletionEvents from 10.11.10.167:54473 Call#1905 Retry#0 2017-01-18 16:52:16,374 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 53780: responding to org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getTaskAttemptCompletionEvents from 10.11.10.167:54473 Call#1905 Retry#0 Wrote 33 bytes. ...
I do not seee any message pointing to why the AM is not making any progress, but I am not familiar with how the log should look like. I really appreciate if you take a look at the logs and let me know if you see any issues. Or if you can point me to another forum where I should post this issue.
Application Log: check for containers 01_000001 (does not run) and 02_000001 (starts running after 5 minutes)
NodeManager Log : Look for application application_1484164132320_0296
ResourceManager Log : Look for application application_1484164132320_0296