Member since
01-24-2017
26
Posts
0
Kudos Received
0
Solutions
07-09-2018
10:48 AM
i have a 6 DN cluster and every second all the nodemanager is getting down. post the log in https://community.hortonworks.com/questions/202914/node-manager-is-getting-down-after-few-seconds.html and now reducer job is also getting failed. 2018-07-09 06:25:26,262 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-09 06:25:26,616 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0195_02_000001 is : 143
2018-07-09 06:25:26,624 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0197_01_000001 is : 143
2018-07-09 06:25:26,712 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0195CONTAINERID=container_1531130193317_0195_02_000001
2018-07-09 06:25:26,819 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0197CONTAINERID=container_1531130193317_0197_01_000001
2018-07-09 06:25:30,271 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-09 06:25:30,534 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0198_02_000001 is : 143
2018-07-09 06:25:30,600 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0198CONTAINERID=container_1531130193317_0198_02_000001
2018-07-09 06:25:31,258 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-09 06:25:31,422 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0200_02_000001 is : 143
2018-07-09 06:25:31,486 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0200CONTAINERID=container_1531130193317_0200_02_000001
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
07-09-2018
09:32 AM
@Sandeep Nemuri i checked the log again,this time i got this error. and code 143 error 2018-07-09 04:20:14,067 WARN containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1083)) - Event EventType: FINISH_APPLICATION sent to absent application application_1531115940804_0668
2018-07-09 04:20:15,985 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-09 04:20:16,705 WARN localizer.ResourceLocalizationService (ResourceLocalizationService.java:update(1023)) - { hdfs://ip-172-31-17-251.ec2.internal:8020/tmp/hive/root/_tez_session_dir/41b28f9d-f7ae-4652-b9a7-ffed72220a41/.tez/application_1531115940804_0607/tez.session.local-resources.pb, 1531123411553, FILE, null } failed: File does not exist: hdfs://ip-172-31-17-251.ec2.internal:8020/tmp/hive/root/_tez_session_dir/41b28f9d-f7ae-4652-b9a7-ffed72220a41/.tez/application_1531115940804_0607/tez.session.local-resources.pb
2018-07-09 04:20:16,719 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=rootOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: LOCALIZATION_FAILEDAPPID=application_1531115940804_0607CONTAINERID=container_1531115940804_0607_02_000001
2018-07-09 04:20:16,719 WARN ipc.Client (Client.java:call(1446)) - interrupted waiting to send rpc request to server
2018-07-09 04:20:52,186 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-09 04:25:34,693 WARN containermanager.AuxServices (AuxServices.java:serviceInit(130)) - The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2018-07-09 04:25:34,761 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:serviceInit(154)) - NodeManager configured with 61.4 G physical memory allocated to containers, which is more than 80% of the total physical memory available (62.9 G). Thrashing might happen.
2018-07-09 04:25:35,592 WARN containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1067)) - Event EventType: KILL_CONTAINER sent to absent container container_1531115940804_0760_01_000001
2018-07-09 04:25:35,592 WARN containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1083)) - Event EventType: FINISH_APPLICATION sent to absent application application_1531115940804_0760 i have 6 DD( 62GB RAM, 16vcore). mappred-sire.xml mapreduce.map.java.opts -Xmx6656m
mapreduce.map.memory.mb 10000
mapreduce.reduce.java.opts -Xmx12800m
mapreduce.reduce.memory.mb 16000 yarn-site.xml yarn.nodemanager.resource.memory-mb 62900
yarn.scheduler.minimum-allocation-mb 6656
yarn.scheduler.maximum-allocation-mb 62900
... View more
07-06-2018
06:20 AM
I have a cluster of 6 data nodes and on starting the node manager it is getting down in every data nodes. i checked the logs at /var/log/hadoop-yarn/yarn and there is no error message. this is the warning message i got. 2018-07-06 01:53:37,256 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-06 01:53:39,261 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-06 01:53:44,913 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1530855405184_0121_02_000001 is : 143
2018-07-06 01:53:44,933 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1530855405184_0121CONTAINERID=container_1530855405184_0121_02_000001
2018-07-06 01:56:11,578 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
... View more
Labels:
- Labels:
-
Apache YARN
09-17-2017
05:42 PM
Hi, guys. I don't know why disk usage is different when I'm running du & df. 'hdfs dfs -du -h -s /' gives around 2TB in total and 'hadoop fs -df' does Filesystem Size Used Available Use% hdfs://ip:8020 15.8 T 12.6 T 2.3 T 80% and 'sudo -u hdfs hdfs fsck /' gives me this Total size: 2158294971710 B (Total open files size: 341851 B) Total dirs: 627169
Total files: 59276 Total symlinks: 0 (Files currently being written: 13)
Total blocks (validated): 23879 (avg. block size 90384646 B) (Total open file blocks (not validated): 13) Minimally replicated blocks: 23879 (100.0 %) Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 8 (0.03350224 %)
Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0228233 Corrupt blocks: 0 Missing replicas: 32 (0.066204615 %)
Number of data-nodes: 6
Number of racks: 1 Let me know why and how I can get my mysteriously used space. Thanks.
... View more
Labels:
- Labels:
-
Apache Hadoop
01-25-2017
07:30 AM
thnx @Jay SenSharma now its working. But now when i am doing some query in hue it doesnt show any output
... View more
01-25-2017
07:03 AM
when i start hue i get this issue on hue hadoop.hdfs_clusters.default.webhdfs_urlCurrent value: http://namenode:50070/webhdfs/v1
Filesystem root '/' should be owned by 'hdfs'
... View more
Labels:
- Labels:
-
Cloudera Hue
01-24-2017
04:12 PM
@Jay SenSharma Thnx, now its working.
... View more
01-24-2017
10:52 AM
@Jay SenSharma now am getting this error in datanode.log 2017-01-24 03:39:19,891 INFO common.Storage (Storage.java:tryLock(715)) - Lock on /mnt/disk1/hadoop/hdfs/data/in_use.lock acquired by nodename 1491@datanode.ec2.internal
2017-01-24 03:39:19,902 INFO common.Storage (Storage.java:tryLock(715)) - Lock on /mnt/disk2/hadoop/hdfs/data/in_use.lock acquired by nodename 1491@datanode.ec2.internal
2017-01-24 03:39:19,903 FATAL datanode.DataNode (BPServiceActor.java:run(840)) - Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to namenode.ec2.internal/ namenode:8020. Exiting. java.io.IOException: Incompatible clusterIDs in /mnt/disk1/hadoop/hdfs/data: namenode clusterID = CID-297a140f-7cd6-4c73-afc8-bd0a7d01c0ee; datanode clusterID = CID-7591e6bd-ce9b-4b14-910c-c9603892a0f1
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:646)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:320)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:403)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:422)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1311)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1276)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:220)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:828)
at java.lang.Thread.run(Thread.java:745)
2017-01-24 03:39:19,904 WARN datanode.DataNode (BPServiceActor.java:run(861)) - Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to ip-172-31-17-251.ec2.internal/172.31.17.251:8020
2017-01-24 03:39:20,005 INFO datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool <registering> (Datanode Uuid unassigned)
2017-01-24 03:39:22,005 WARN datanode.DataNode (DataNode.java:secureMain(2392)) - Exiting Datanode
2017-01-24 03:39:22,007 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0
2017-01-24 03:39:22,008 INFO datanode.DataNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at datanode.ec2.internal/datanode
************************************************************/
... View more
01-24-2017
10:40 AM
yeah that was right total 4
drwxrwxrwt. 2 hdfs hadoop 4096 Nov 19 2014 cache
srw-rw-rw-. 1 hdfs hadoop 0 Jan 24 03:39 dn_socket
actually non of my datanode host is working.
is that memory issue.
... View more
01-24-2017
10:11 AM
@Jay SenSharma so the error is in log file of permissions
... View more