Created 09-23-2017 10:05 PM
We ´have below HDP 2.6 environment.
Today we did the ennoblement of HA for the Resource Manager using Ambari and after this we see that the nodemanager on hosts don't come up.
below is some more information which I see in all the node manager logs residing on 4 data nodes:-
2017-09-22 11:55:27,006 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 23651 for container-id container_e11_1504517816511_0003_01_000001: 284.9 MB of 1 GB physical memory used; 2.1 GB of 2.1 GB virtual memory used
2017-09-22 11:55:27,074 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2017-09-22 11:55:27,114 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
2017-09-22 11:55:27,222 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(519)) - Applications still running : [application_1504517816511_0003]
2017-09-22 11:55:27,226 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 45454
2017-09-22 11:55:27,240 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(178)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
2017-09-22 11:55:27,241 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(527)) - Aborting log aggregation for application_1504517816511_0003
2017-09-22 11:55:27,241 INFO logaggregation.LogAggregationService (LogAggregationService.java:stopAggregators(199)) - Waiting for aggregation to complete for application_1504517816511_0003
2017-09-22 11:55:27,243 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 45454
2017-09-22 11:55:27,264 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:27,424 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(545)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-09-22 11:55:27,487 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 8040
2017-09-22 11:55:27,496 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(883)) - Public cache exiting
2017-09-22 11:55:27,496 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:27,499 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 8040
2017-09-22 11:55:27,506 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping NodeManager metrics system...
2017-09-22 11:55:27,508 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted.
2017-09-22 11:55:27,510 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - NodeManager metrics system stopped.
2017-09-22 11:55:27,510 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(606)) - NodeManager metrics system shutdown complete.
2017-09-22 11:55:27,517 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at
************************************************************/
2017-09-22 11:26:08,717 INFO recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:run(1016)) - Starting full compaction cycle
2017-09-22 11:26:08,718 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) - Level-0 table #6418: started
2017-09-22 11:26:08,718 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) - Level-0 table #6418: 0 bytes OK
2017-09-22 11:26:08,726 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) - Delete type=0 #6415
2017-09-22 11:26:08,726 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) - Manual compaction at level-0 from (begin) .. (end); will stop at (end)
2017-09-22 11:26:08,726 INFO recovery.NMLeveldbStateStoreService$LeveldbLogger (NMLeveldbStateStoreService.java:log(1032)) - Manual compaction at level-1 from (begin) .. (end); will stop at (end)
2017-09-22 11:26:08,726 INFO recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:run(1023)) - Full compaction cycle completed in 9 msec
2017-09-22 11:36:58,466 INFO security.NMContainerTokenSecretManager (NMContainerTokenSecretManager.java:setMasterKey(138)) - Rolling master-key for container-tokens, got key with id -2010831289
2017-09-22 11:36:58,466 INFO security.NMTokenSecretManagerInNM (NMTokenSecretManagerInNM.java:setMasterKey(135)) - Rolling master-key for container-tokens, got key with id 1019460078
2017-09-22 11:55:26,722 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2017-09-22 11:55:26,754 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
2017-09-22 11:55:26,871 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 45454
2017-09-22 11:55:26,876 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(178)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
2017-09-22 11:55:26,876 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:26,880 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 45454
2017-09-22 11:55:27,049 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(545)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-09-22 11:55:27,099 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 8040
2017-09-22 11:55:27,101 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:27,101 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(883)) - Public cache exiting
2017-09-22 11:55:27,102 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 8040
2017-09-22 11:55:27,105 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping NodeManager metrics system...
2017-09-22 11:55:27,105 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted.
2017-09-22 11:55:27,110 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - NodeManager metrics system stopped.
2017-09-22 11:55:27,111 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(606)) - NodeManager metrics system shutdown complete.
2017-09-22 11:55:27,126 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at
************************************************************/
yarn-yarn-nodemanager-ls5387v4.log lines 981844-981905/981905 (END)
2017-09-22 11:55:26,957 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2017-09-22 11:55:26,979 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
2017-09-22 11:55:27,127 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(519)) - Applications still running : [application_1504517816511_0002]
2017-09-22 11:55:27,131 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 45454
2017-09-22 11:55:27,237 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(178)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
2017-09-22 11:55:27,238 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(527)) - Aborting log aggregation for application_1504517816511_0002
2017-09-22 11:55:27,238 INFO logaggregation.LogAggregationService (LogAggregationService.java:stopAggregators(199)) - Waiting for aggregation to complete for application_1504517816511_0002
2017-09-22 11:55:27,238 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 45454
2017-09-22 11:55:27,249 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:27,351 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(545)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-09-22 11:55:27,419 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 8040
2017-09-22 11:55:27,424 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:27,424 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 8040
2017-09-22 11:55:27,425 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(883)) - Public cache exiting
2017-09-22 11:55:27,431 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping NodeManager metrics system...
2017-09-22 11:55:27,440 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted.
2017-09-22 11:55:27,442 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - NodeManager metrics system stopped.
2017-09-22 11:55:27,443 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(606)) - NodeManager metrics system shutdown complete.
2017-09-22 11:55:27,461 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at
************************************************************/
2017-09-22 11:55:27,494 ERROR nodemanager.NodeManager (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2017-09-22 11:55:27,551 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
2017-09-22 11:55:27,659 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(519)) - Applications still running : [application_1504517816511_0001, application_1504517816511_0004]
2017-09-22 11:55:27,662 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 45454
2017-09-22 11:55:27,697 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(178)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
2017-09-22 11:55:27,698 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 45454
2017-09-22 11:55:27,698 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:27,705 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(527)) - Aborting log aggregation for application_1504517816511_0001
2017-09-22 11:55:27,705 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(527)) - Aborting log aggregation for application_1504517816511_0004
2017-09-22 11:55:27,705 INFO logaggregation.LogAggregationService (LogAggregationService.java:stopAggregators(199)) - Waiting for aggregation to complete for application_1504517816511_0001
2017-09-22 11:55:27,705 INFO logaggregation.LogAggregationService (LogAggregationService.java:stopAggregators(199)) - Waiting for aggregation to complete for application_1504517816511_0004
2017-09-22 11:55:27,758 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(545)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-09-22 11:55:28,197 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 8040
2017-09-22 11:55:28,223 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder
2017-09-22 11:55:28,225 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 8040
2017-09-22 11:55:28,225 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping NodeManager metrics system...
2017-09-22 11:55:28,225 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(883)) - Public cache exiting
2017-09-22 11:55:28,228 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted.
2017-09-22 11:55:28,355 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - NodeManager metrics system stopped.
2017-09-22 11:55:28,356 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(606)) - NodeManager metrics system shutdown complete.
2017-09-22 11:55:28,391 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
************************************************************/
Let me know how I could overcome this issue and make the cluster up and running.
I have restarted the completed cluster but does not help.
In Ambari when I start the service I see the following in the nodemanager restart waiting for a long time:
2017-09-22 16:59:03,828 - Forcefully ensuring existence and permissions of the directory: /var/hadoop/yarn/log 2017-09-22 16:59:03,828 - Directory['/var/hadoop/yarn/log'] {'group': 'hadoop', 'cd_access': 'a', 'create_parents': True, 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'} 2017-09-22 16:59:03,842 - Host contains mounts: ['/sys', '/proc', '/dev', '/sys/kernel/security', '/dev/shm', '/dev/pts', '/run', '/sys/fs/cgroup', '/sys/fs/cgroup/systemd', '/sys/fs/pstore', '/sys/fs/cgroup/cpuset', '/sys/fs/cgroup/cpu,cpuacct', '/sys/fs/cgroup/memory', '/sys/fs/cgroup/devices', '/sys/fs/cgroup/freezer', '/sys/fs/cgroup/blkio', '/sys/fs/cgroup/perf_event', '/', '/proc/sys/fs/binfmt_misc', '/sys/kernel/debug', '/dev/mqueue', '/var', '/data1', '/net', '/sapmnt/home', '/sapmnt/HOME', '/sapmnt/HOME/d057137']. 2017-09-22 16:59:03,843 - Mount point for directory /var/hadoop/yarn/log is /var 2017-09-22 16:59:03,843 - File['/var/lib/ambari-agent/data/yarn/yarn_log_dir_mount.hist'] {'content': '\n# This file keeps track of the last known mount-point for each dir.\n# It is safe to delete, since it will get regenerated the next time that the component of the service starts.\n# However, it is not advised to delete this file since Ambari may\n# re-create a dir that used to be mounted on a drive but is now mounted on the root.\n# Comments begin with a hash (#) symbol\n# dir,mount_point\n/var/hadoop/yarn/log,/var\n', 'owner': 'hdfs', 'group': 'hadoop', 'mode': 0644} 2017-09-22 16:59:03,844 - Directory['/var/lib/ambari-agent/data/yarn'] {'create_parents': True, 'mode': 0755} 2017-09-22 16:59:03,844 - Mount point for directory /data1/hadoop/yarn/local is /data1 2017-09-22 16:59:03,844 - Mount point for directory /data1/hadoop/yarn/local is /data1 2017-09-22 16:59:03,845 - Forcefully ensuring existence and permissions of the directory: /data1/hadoop/yarn/local 2017-09-22 16:59:03,845 - Directory['/data1/hadoop/yarn/local'] {'group': 'hadoop', 'cd_access': 'a', 'recursive_mode_flags': {'d': 'a+rwx', 'f': 'a+rw'}, 'create_parents': True, 'ignore_failures': True, 'mode': 0755, 'owner': 'yarn'}
Anish
Created 09-24-2017 02:47 AM
What is the application_1504517816511_0001~4? I'm not sure, if a run application_1504517816511 before enabled ResourceManager HA, then kill this application first.
Created 09-24-2017 12:45 PM
Hi Peter,
I stopped the complete cluster and restarted it again. But still the nodemanager on all the hosts is not coming up .While starting all the services the 4 data nodes which have nodemanagers on them are showing the following in the ambari startup log and are in the waiting state for a long time:-
2017-09-24 14:33:21,444 - Forcefully ensuring existence and permissions of the directory: /var/hadoop/yarn/log 2017-09-24 14:33:21,445 - Directory['/var/hadoop/yarn/log'] {'group': 'hadoop', 'cd_access': 'a', 'create_parents': True, 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'} 2017-09-24 14:33:21,456 - Host contains mounts: ['/sys', '/proc', '/dev', '/sys/kernel/security', '/dev/shm', '/dev/pts', '/run', '/sys/fs/cgroup', '/sys/fs/cgroup/systemd', '/sys/fs/pstore', '/sys/fs/cgroup/cpuset', '/sys/fs/cgroup/cpu,cpuacct', '/sys/fs/cgroup/memory', '/sys/fs/cgroup/devices', '/sys/fs/cgroup/freezer', '/sys/fs/cgroup/blkio', '/sys/fs/cgroup/perf_event', '/', '/proc/sys/fs/binfmt_misc', '/sys/kernel/debug', '/dev/mqueue', '/var', '/data1', '/net', '/sapmnt/home', '/sapmnt/HOME', '/proc/sys/fs/binfmt_misc']. 2017-09-24 14:33:21,456 - Mount point for directory /var/hadoop/yarn/log is /var 2017-09-24 14:33:21,457 - File['/var/lib/ambari-agent/data/yarn/yarn_log_dir_mount.hist'] {'content': '\n# This file keeps track of the last known mount-point for each dir.\n# It is safe to delete, since it will get regenerated the next time that the component of the service starts.\n# However, it is not advised to delete this file since Ambari may\n# re-create a dir that used to be mounted on a drive but is now mounted on the root.\n# Comments begin with a hash (#) symbol\n# dir,mount_point\n/var/hadoop/yarn/log,/var\n', 'owner': 'hdfs', 'group': 'hadoop', 'mode': 0644} 2017-09-24 14:33:21,458 - Directory['/var/lib/ambari-agent/data/yarn'] {'create_parents': True, 'mode': 0755} 2017-09-24 14:33:21,458 - Mount point for directory /data1/hadoop/yarn/local is /data1 2017-09-24 14:33:21,459 - Mount point for directory /data1/hadoop/yarn/local is /data1 2017-09-24 14:33:21,459 - Forcefully ensuring existence and permissions of the directory: /data1/hadoop/yarn/local 2017-09-24 14:33:21,460 - Directory['/data1/hadoop/yarn/local'] {'group': 'hadoop', 'cd_access': 'a', 'recursive_mode_flags': {'d': 'a+rwx', 'f': 'a+rw'}, 'create_parents': True, 'ignore_failures': True, 'mode': 0755, 'owner': 'yarn'}
After a certain time frame a message is triggeredkilled the python after a certain time limit. Still not able to conclude where exactly the problem is lying and how to overcome it.
Thanks & regards,
Anish
Created 09-24-2017 01:46 PM
I need ResourceManager log when all YARN Services are restarting.
Created 09-24-2017 07:03 PM
Hi Peter,
I cannot attach the logs from the yarn as the file is too big and cannot be uploaded here.
But what I have is the below logs from Ambari when it tried to starts the nodemanagers on different host but fails and it is mentioned in the attached file
May be this helps you bit .
Thanks
Anish
Created 09-24-2017 01:08 PM
Here is an extract of a HCC solution
https://nicholasmaillard.wordpress.com/2015/07/20/formatting-hdfs/
The above article describes right steps to format hdfs for HA enabled clusters as below:
"The initial steps are very close
These steps should help you to overcome the issue.
Created 09-24-2017 02:51 PM
Hi Geoffrey,
Thanks for your reply.
Why do you think That I need to format the namenode service even though I see that it is coming up fine and has no problems?
My major concern is nodemanagers which are not starting up on all the hosts of the datanodes.
This is what I need to get resolved rather than the namnode service.
Thanks & Regards,
Anish
Created 09-25-2017 01:25 AM
No, I just need a certain section log from start ResourceManager message to end (startup and listening port)
For example, if ResourceManager is deployed without any issues, then you'll check this log messages.
hadoop-yarn/yarn-resourcemanager-fqdn.log
2017-09-25 10:16:33,491 INFOresourcemanager.ResourceManager (LogAdapter.java:info(45)) - STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting ResourceManager
STARTUP_MSG: user = yarn
STARTUP_MSG: host = yarn.hdp.exem.com/10.10.30.41
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.7.3.2.5.3.0-37
STARTUP_MSG: classpath = /usr/hdp/current/hadoop-client/conf
...........
Successfully created /yarn-leader-election/yarn-cluster in ZK.
........
2017-09-25 10:16:39,630 INFOpb.RpcServerFactoryPBImpl (RpcServerFactoryPBImpl.java:createServer(174)) - Adding protocol org.apache.hadoop.yarn.server.api.ResourceManagerAdministrationProtocolPB to the server
2017-09-25 10:16:39,641 INFOipc.Server (Server.java:run(1045)) - IPC Server Responder: starting
2017-09-25 10:16:39,641 INFOipc.Server (Server.java:run(881)) - IPC Server listener on 8033: starting