Member since
03-01-2016
104
Posts
97
Kudos Received
3
Solutions
12-25-2016
10:30 AM
1 Kudo
SYMPTOMS: Due to both Resource manager getting active simultaneously , all node managers crash. Errors visible in RM logs are as follows: 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService (AbstractService.java:enterState(452)) - Service: Dispatcher entered state STOPPED
2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher thread interruptedjava.lang.InterruptedExceptionat java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
atjava.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) While AsyncDispatcher is in hung state, we keep getting below errors:- 2015-06-27 20:08:35,926 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(140)) - AsyncDispatcher is draining to stop, igonring any new events.
2015-06-27 20:08:36,926 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(144)) - Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2015-06-27 20:08:37,927 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(144)) - Waiting for AsyncDispatcher to drain. Thread state is :WAITING ROOT CAUSE: This a known issue reported in YARN-3878 WORKAROUND: Stop one resource manager and start another manually to resume services. REFERENCE: https://issues.apache.org/jira/browse/YARN-3878
... View more
Labels:
12-24-2016
10:21 PM
1 Kudo
SYMPTOMS: Although valid Kerberos ticket is available, we are unable to put files in HDFS encrypted zone. If we restart/failover namenode, then we are able to put files with the same ticket/credentials without having to get a new ticket. Below is the demo of the issue where /tmp/user1 is the encrypted zone and user has permission to that zone:
[root@test ~]# su - user Last login: Thu Oct 13 13:03:24 EDT 2016 on pts/57 -bash-4.2$ id uid=11516(user) gid=5000(bns) groups=5000(bns),1520(cmtsuser),1800(admin),4534(edgegrp),4535(edgedgrp),4536(k2tstgrp),8242(ftallocctxd),8243(ftallocctxu),15113(hdpadm)
-bash-4.2$ kinit Password for user@123.EXAMPLE.COM:
-bash-4.2$ klist
Ticket cache: FILE:/tmp/krb5cc_11516
Default principal: user@123.EXAMPLE.COM
Valid starting Expires Service principal
10/14/2016 07:23:51 10/14/2016 17:23:51 krbtgt/123.EXAMPLE.COM@EXAMPLE.COM
renew until 10/21/2016 07:23:48
-bash-4.2$ hadoop fs -put file1 /tmp/user1/file_1
put: java.util.concurrent.ExecutionException: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
-bash-4.2$
-bash-4.2$ hadoop fs -put file1 /tmp/file_1
-bash-4.2$ hadoop fs -cat /tmp/file_1
diana
-bash-4.2$
ROOT CAUSE: Service Delegation Token (DT) renewal was not working because the customer code misses the token renewer class in KMS. After enabling Hadoop KMS, the cluster can only work normally until the configured time in property hadoop.kms.authentication.delegation-token.renew-interval.sec is reached. The config does not exists in customer code, and the default one is 86400 essentially 1 day.
SOLUTION: Following options are provided:
If the customer plans to upgrade to newer version (e.g. HDP 2.5), the problem does not exist as all fixes will be included.
Otherwise a hotfix can be provided for them to include those fixes. Please raise a support case for the same.
REFERENCE:
https://issues.apache.org/jira/browse/HADOOP-13155
... View more
Labels:
12-24-2016
09:40 PM
2 Kudos
SYMPTOMS: When local disk utilization of multiple node managers goes high beyond a limit, nodes turn “unhealthy” and gets into the "blacklist" not to be used for container/task allocation, hence reducing the effective cluster capacity. ROOT CAUSE: A burst or rapid rate of submitted jobs with substantial NM usercache resource localization footprint may lead to rapid fill up of the NM local temporary file system with negative consequences in terms of stability. The core issue seems to be the fact that NM continues to localize the resources beyond the maximum local cache size (yarn.nodemanager.localizer.cache.target-size-mb , default 10G). Since maximum local cache size is effectively not taken into account when localizing new resources (note that default cache cleanup interval is 10 min controlled by yarn.nodemanager.localizer.cache.cleanup.interval-ms), this basically leads to sort of self-destruction scenario : Once the filesystem utilization reaches the threshold of 90%, NM will automatically de-register from RM, effectively leading to NM outage. This issue may offline many NMs simultaneously at the same time and thus is quite critical in terms of platform stability. SOLUTION: Keep larger/multiple mount points for these local directories. No consensus has been achieved yet in the discussion if HDFS filesystem could be used for these directories. REFERENCE: https://issues.apache.org/jira/browse/YARN-5140
... View more
Labels:
12-24-2016
05:23 PM
2 Kudos
ENVIRONMENT: HDP 2.3.4, Ambari 2.2.1 SYMPTOMS: After creating an encryption zone and attempting to move data in this zone "Authentication Required" errors are reported in the kms-audit.log. 2016-11-15 09:06:40,561 UNAUTHENTICATED RemoteHost:W.X.Y.Z
Method:OPTIONS URL:http://hdp02.example.com:9292/kms/v1/keyversion/e1dw_dev_enc_key%400/_eek?eek_op=decrypt&doAs=test ErrorMsg:'Authentication required' The issue reproduces only in ambari-view and not via HDFS commands. Following errors are reported from browser. 500 org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) ROOT CAUSE: For webhdfs to work with TDE, Ranger KMS must be configured to allow hdfs user to access all keys. This is a configuration issue in Ranger KMS. Ambari deploys default configuration where hdfs user is not allowed. This is a known behavior reported in BUG-45012 <property>
<name>hadoop.kms.blacklist.DECRYPT_EEK</name>
<value>hdfs</value>
<description>
Blacklist for decrypt EncryptedKey
CryptoExtension operations
</description>
</property> SOLUTION: Upgrade to HDP 2.3.6 WORKAROUND: Since its a security issue, please log a case with HWX support team for any suggestions about possible workarounds.
... View more
12-24-2016
04:31 PM
ROOT CAUSE: YARN UI shows total memory vs used memory wrong when there are reserved resources. The memory total shown when there is no reserved resources will be the correct one. We can also compare that with sum of all NodeManager memory resource. This won't have any impact on YARN scheduler logic. This behavior has been reported in bug in YARN UI and got fixed by https://issues.apache.org/jira/browse/YARN-3432 and https://issues.apache.org/jira/browse/YARN-3243
REFERENCES: https://issues.apache.org/jira/browse/YARN-3432 https://issues.apache.org/jira/browse/YARN-3243
... View more
Labels:
07-27-2018
09:41 AM
Here is my solution https://community.hortonworks.com/questions/208928/increase-open-file-limit-of-the-user-to-scale-for.html
... View more
05-13-2018
06:46 PM
resolved. For me it was problem with one of the JN
... View more
12-24-2016
02:36 PM
SYMPTOMS: /tmp filling up causes multiple services to stop functioning. ROOT CAUSE: The issue happens due to internal Smartsense bug ST-2551. SOLUTION: Upgrade to Smartsense 1.3.1 WORKAROUND: To workaround this issue we need to manually modify two files related to Smartsense, so that the tmp files will not be generated in /tmp directory anymore 1. File : /usr/hdp/share/hst/hst-agent/lib/hst_agent/anonymize.py Change from : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\
" -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\ " -cp {1} {2} {3}" Change to : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\
" -Djava.io.tmpdir=/grid/02/smartsense/hst-agent/data/tmp" +\
" -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\
" -cp {1} {2} {3}" Make sure the tmp dir value is same as this property agent.tmp_dir in hst-agent-conf. 2. File : /usr/sbin/hst-server.py Change from : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\ "java -server -XX:NewRatio=3 "\
"-XX:+UseConcMarkSweepGC " +\
"-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\
debug_options +\
" -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\
" com.hortonworks.support.tools.server.SupportToolServer "\
">" + SERVER_OUT_FILE + " 2>&1 &" Change to : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\
"java -server -XX:NewRatio=3 "\ "-XX:+UseConcMarkSweepGC " +\
"-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\
"-Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp " +\
debug_options +\
" -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\
" com.hortonworks.support.tools.server.SupportToolServer "\
">" + SERVER_OUT_FILE + " 2>&1 &" Make sure the tmp dir value is same as this property server.tmp.dir in hst-server-conf. 3. After above changes, pease clean up existing .pyc files from both of the above directories, and restart Smartsense server and agents for changes to take effect.
... View more
Labels:
12-24-2016
02:00 PM
ENVIRONMENT: All Ambari versions prior to 2.4.x SYMPTOMS: Intermittent loss of heartbeat to cluster nodes, freeze of ambari-agent service, intermittent issues in Ambari alerts and service status updates in Ambari dashboard. Ambari-agent logs:- INFO 2016-08-21 19:10:20,080 Heartbeat.py:78 - Building Heartbeat: {responseId = 139566, timestamp = 1471821020080, commandsInProgress = False, componentsMapped = True}ERROR
2016-08-21 19:10:20,102 HostInfo.py:228 - Checking java processes failedTraceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/HostInfo.py", line 211, in javaProcs cmd = open(os.path.join('/proc', pid, 'cmdline'), 'rb').read()IOError: [Errno 2] No such file or directory: '/proc/24270/cmdline' Top command output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP TIME DATA COMMAND 10098 root 20 0 54.4g 53g 4540 S 54.5 14.0 18000:11 224 300,00 54g /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start --expected-hostname=123.example.com ROOT CAUSE: Race condition in subprocess python module. Due to this race condition, at some unlucky cases python garbage collection was disabled. This usually happened when running alerts, as a bunch of our alerts run shell commands and they do it in different threads. This is a known issue reported in AMBARI-17539. SOLUTION: Upgrade to Ambari 2.4.x WORKAROUND: Restart ambari-agent which would fix issue temporarily. Log a case with HWX support to get a patch for the bug fix.
... View more
Labels:
12-24-2016
01:24 PM
ENVIRONMENT: HDP 2.3.2 , Ambari 2.2.0,JDK 1.7.0_67-b01,Kernel: 3.13.0-48-generic
ERRORS: Last few lines in the NM log before it hit SIGSEGV shows that there was Container Localizer running for a new container:
2016-10-20 01:29:05,810 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(711)) - Created localizer for container_e14_1475595980406_28807_01_000021
[...]
2016-10-20 01:29:08,308 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://user/tmp/hive/xxx/5b0f04c6-ba2d-47dc-85c2-88179a1db407/hive_2016-10-20_01-28-15_091_3309851709548218363-3928/-mr-10007/df6632b4-ec58-4cdf-8ffb-c81460abc266/reduce.xml(->/hadoop/yarn/local/usercache/xxx/filecache/150663/reduce.xml) transitioned from DOWNLOADING to LOCALIZED
- The exception says:
Current thread (0x00007f2c66cc7000): JavaThread "ContainerLocalizer Downloader" [_thread_in_Java, id=14260, stack(0x00007f2c740a3000,0x00007f2c741a4000)]
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000801f0ffb
- And the stack trace for '14260' shows:
Stack: [0x00007f2c740a3000,0x00007f2c741a4000], sp=0x00007f2c741a0fc8, free space=1015kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getClientNameBytes()Lcom/google/protobuf/ByteString;+0
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getSerializedSize()I+48 J 915 C2 com.google.protobuf.CodedOutputStream.computeMessageSize(ILcom/google/protobuf/MessageLite;)I (10 bytes) @ 0x00007f2cad207530 [0x00007f2cad207500+0x30]
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$OpReadBlockProto.getSerializedSize()I+30 J 975 C2 com.google.protobuf.AbstractMessageLite.writeDelimitedTo(Ljava/io/OutputStream;)V (40 bytes) @ 0x00007f2cad254124 [0x00007f2cad2540e0+0x44]
j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.send(Ljava/io/DataOutputStream;Lorg/apache/hadoop/hdfs/protocol/datatransfer/Op;Lcom/google/protobuf/Message;)V+60
j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.readBlock(Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token;Ljava/lang/String;
JJZLorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)V+49
j org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(Ljava/lang/String;Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token;
JJZLjava/lang/String;Lorg/apache/hadoop/hdfs/net/Peer;Lorg/apache/hadoop/hdfs/protocol/DatanodeID;Lorg/apache/hadoop/hdfs/PeerCache;Lorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)Lorg/apache/hadoop/hdfs/BlockReader;+43
j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(Lorg/apache/hadoop/hdfs/net/Peer;)Lorg/apache/hadoop/hdfs/BlockReader;+109
j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp()Lorg/apache/hadoop/hdfs/BlockReader;+78
[...]
ROOT CAUSE: Segmentation fault in a Java process is usually due to a JVM bug.In this case, user is on an older JDK version (1.7.0_67-b01).Updating to a more recent 1.7 release should be attempted to see if it resolves the SIGSEGV.
... View more
Labels: