Member since
12-22-2016
5
Posts
0
Kudos Received
0
Solutions
01-24-2017
09:20 AM
Hi Wilfred, I got the debug log for AM container. After containers are allocated, they get assigned: 2017-01-18 16:41:57,961 DEBUG [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning container container_1484164132320_0296_01_000002 with priority 20 to NM msl-dpe-perf43p10:8041
2017-01-18 16:41:57,961 DEBUG [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigning container container_1484164132320_0296_01_000003 with priority 20 to NM msl-dpe-perf43p10:8041
... And the rest of the messages seem like checking connection: 2017-01-18 16:52:16,359 DEBUG [Socket Reader #1 for port 53780] org.apache.hadoop.ipc.Server: got #1905
2017-01-18 16:52:16,359 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 53780: org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getTaskAttemptCompletionEvents from 10.11.10.167:54473 Call#1905 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
2017-01-18 16:52:16,359 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:anahita.sh (auth:SIMPLE) from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
2017-01-18 16:52:16,374 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: Served: getTaskAttemptCompletionEvents queueTime= 4 procesingTime= 11
2017-01-18 16:52:16,374 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 53780: responding to org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getTaskAttemptCompletionEvents from 10.11.10.167:54473 Call#1905 Retry#0
2017-01-18 16:52:16,374 DEBUG [IPC Server handler 0 on 53780] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 53780: responding to org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getTaskAttemptCompletionEvents from 10.11.10.167:54473 Call#1905 Retry#0 Wrote 33 bytes.
... I do not seee any message pointing to why the AM is not making any progress, but I am not familiar with how the log should look like. I really appreciate if you take a look at the logs and let me know if you see any issues. Or if you can point me to another forum where I should post this issue. Application Log: check for containers 01_000001 (does not run) and 02_000001 (starts running after 5 minutes) NodeManager Log : Look for application application_1484164132320_0296 ResourceManager Log : Look for application application_1484164132320_0296 Thank you, Anahita
... View more
01-06-2017
10:02 AM
I am still trying to root cause the issue. Here is the log for an application where first attempt at ApplicationMaster fails, and the second one is successful. First ApplicationMaster stops progressing after allocating containers and is finally killed by Yarn. Here is the last entries in container log: 2017-01-04 15:37:37,689 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1483570317962_0001_m_000039_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2017-01-04 15:37:37,690 INFO [Thread-52] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceRequest:<memory:4096, vCores:1>
2017-01-04 15:37:37,709 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1483570317962_0001, File: hdfs://msl-dpe-perf44p10:8020/user/anahita.sh/.staging/job_1483570317962_0001/job_1483570317962_0001_1.jhist
2017-01-04 15:37:38,033 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system [hdfs://msl-dpe-perf44p10:8020]
2017-01-04 15:37:38,611 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:40 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2017-01-04 15:37:38,687 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1483570317962_0001: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:490976, vCores:131> knownNMs=3
2017-01-04 15:37:39,743 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 40 The second attempt starts similarly but continues to allocating containers and complete sucessfully: 2017-01-04 15:48:40,432 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 40
2017-01-04 15:53:11,297 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1483570317962_0001_02_000002 to attempt_1483570317962_0001_m_000000_1000
2017-01-04 15:53:11,299 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1483570317962_0001_02_000003 to attempt_1483570317962_0001_m_000001_1000
2017-01-04 15:53:11,299 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1483570317962_0001_02_000004 to attempt_1483570317962_0001_m_000002_1000 I am at loss on what the problem is and how I should debug it. Do you still believe it is not the Resource Manager? Is there anyway to get more detailed log for application master container? I have uploaded the log and resource manager log to my google drive: application log (Check for containers container_1483570317962_0001_01_000001 and container_1483570317962_0001_02_000001) ResourceManager log Thank you, Anahita
... View more
12-29-2016
02:31 PM
Thank you for your help. It's good to know RM is not the problem. I reran the same job with JobHistory, ResourceManager and NodeManager logging set to Debug, but still can't figure out why application stops progressing. Sometimes both AM attempts fail and sometimes first one fails, but second attempt is successful. I checked the logs, nothing stood out to me. Connectivity seems to be there, and first container (ApplicationMaster) status is running, but MapReduce tasks never start. You can see the logs in TPCx-BB group: https://groups.google.com/forum/#!topic/big-bench/guchcvulkk0 I did not include RM log, as it was too big. If you think it can help, I can upload it somewhere else. Is there a way to get ApplicationMaster log in addition to these? Appreciate your help. Anahita
... View more
12-27-2016
11:00 AM
Wilfred, Thank you for getting back to me. Below is more details: Cloudera Enterprise Data Hub Edition Trial 5.9.0 (had the same issue with 5.8.0) Java Version 1.7.0. I have a four node cluster with one dedicated to server roles (HDFS NameNode, HiveServer, ClouderaManagement Server, Yarn Server, etc..) and the other three to worker roles (DataNode, NodeManager). Each node has a dual socket Xeon E5-2670 with total 24 physical cores (48 with hyper-threading) and 256 GB memory and one PCIe SSD with 3TB of storage. Yarn configuration options: FairScheduler Container virtual cpu cores: 44 Container memory : 160GB Map(Reduce) task memory: 4GB Map(Reduce) heap size: 3GB Resource manager shows 480GB memory and 132 vcores as expected. The application is TPCx-BB a big data benchmark from TPC which I run using Hive/MapReduce engine and Spark SQL and I see scheduling issues in both cases. Below is my spark configuration. I am using less than available resources due to GC overhead I saw: Num-executors: 9 Executor-cores: 11 Executor-memory : 50G Driver-memory: 10G I am not sure how to attach a file here, but I have posted Yarn log for an application that failed on this forum: https://groups.google.com/forum/#!topic/big-bench/guchcvulkk0 Thank you. Anahita
... View more
12-22-2016
02:30 PM
Hi, I have a 4 node cluster with lots of CPU and memory resources. I am running TPC-BigBench which has mix of hive and MapReduce jobs and I see several failed ApplciationMaster attempts. It seems to be a resource issue, although the portal shows enough resources. For a particular workload, if I reboot the machines, it always runs successfully, but when I run it again without rebooting, jobs get stuck in Pending. I also see that sometimes ApplicationMaster fails, but the second attempt suceeds. Similarly when I run Spark SQL, I see this warning for several minutes (up to 10 min) before it finally allocates resources: WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources. I am not running any other application on this cluster or machines and clearly the cluster has enough resources to run the applications, but sometimes scheduling has a delay or fails. Any idea what could be the problem? I have checked this forum and searched for similar questions, usually the issue indicates lack of resources, which is not the case in my benchmark. Thank you, Anahita Shayesteh
... View more
Labels: