Member since
03-01-2016
104
Posts
97
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1532 | 06-03-2018 09:22 PM | |
26079 | 05-21-2018 10:31 PM | |
1999 | 10-19-2016 07:13 AM |
12-24-2016
02:00 PM
ENVIRONMENT: All Ambari versions prior to 2.4.x SYMPTOMS: Intermittent loss of heartbeat to cluster nodes, freeze of ambari-agent service, intermittent issues in Ambari alerts and service status updates in Ambari dashboard. Ambari-agent logs:- INFO 2016-08-21 19:10:20,080 Heartbeat.py:78 - Building Heartbeat: {responseId = 139566, timestamp = 1471821020080, commandsInProgress = False, componentsMapped = True}ERROR
2016-08-21 19:10:20,102 HostInfo.py:228 - Checking java processes failedTraceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/HostInfo.py", line 211, in javaProcs cmd = open(os.path.join('/proc', pid, 'cmdline'), 'rb').read()IOError: [Errno 2] No such file or directory: '/proc/24270/cmdline' Top command output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP TIME DATA COMMAND 10098 root 20 0 54.4g 53g 4540 S 54.5 14.0 18000:11 224 300,00 54g /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start --expected-hostname=123.example.com ROOT CAUSE: Race condition in subprocess python module. Due to this race condition, at some unlucky cases python garbage collection was disabled. This usually happened when running alerts, as a bunch of our alerts run shell commands and they do it in different threads. This is a known issue reported in AMBARI-17539. SOLUTION: Upgrade to Ambari 2.4.x WORKAROUND: Restart ambari-agent which would fix issue temporarily. Log a case with HWX support to get a patch for the bug fix.
... View more
Labels:
12-24-2016
01:24 PM
ENVIRONMENT: HDP 2.3.2 , Ambari 2.2.0,JDK 1.7.0_67-b01,Kernel: 3.13.0-48-generic
ERRORS: Last few lines in the NM log before it hit SIGSEGV shows that there was Container Localizer running for a new container:
2016-10-20 01:29:05,810 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(711)) - Created localizer for container_e14_1475595980406_28807_01_000021
[...]
2016-10-20 01:29:08,308 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://user/tmp/hive/xxx/5b0f04c6-ba2d-47dc-85c2-88179a1db407/hive_2016-10-20_01-28-15_091_3309851709548218363-3928/-mr-10007/df6632b4-ec58-4cdf-8ffb-c81460abc266/reduce.xml(->/hadoop/yarn/local/usercache/xxx/filecache/150663/reduce.xml) transitioned from DOWNLOADING to LOCALIZED
- The exception says:
Current thread (0x00007f2c66cc7000): JavaThread "ContainerLocalizer Downloader" [_thread_in_Java, id=14260, stack(0x00007f2c740a3000,0x00007f2c741a4000)]
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000801f0ffb
- And the stack trace for '14260' shows:
Stack: [0x00007f2c740a3000,0x00007f2c741a4000], sp=0x00007f2c741a0fc8, free space=1015kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getClientNameBytes()Lcom/google/protobuf/ByteString;+0
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getSerializedSize()I+48 J 915 C2 com.google.protobuf.CodedOutputStream.computeMessageSize(ILcom/google/protobuf/MessageLite;)I (10 bytes) @ 0x00007f2cad207530 [0x00007f2cad207500+0x30]
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$OpReadBlockProto.getSerializedSize()I+30 J 975 C2 com.google.protobuf.AbstractMessageLite.writeDelimitedTo(Ljava/io/OutputStream;)V (40 bytes) @ 0x00007f2cad254124 [0x00007f2cad2540e0+0x44]
j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.send(Ljava/io/DataOutputStream;Lorg/apache/hadoop/hdfs/protocol/datatransfer/Op;Lcom/google/protobuf/Message;)V+60
j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.readBlock(Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token;Ljava/lang/String;
JJZLorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)V+49
j org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(Ljava/lang/String;Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token;
JJZLjava/lang/String;Lorg/apache/hadoop/hdfs/net/Peer;Lorg/apache/hadoop/hdfs/protocol/DatanodeID;Lorg/apache/hadoop/hdfs/PeerCache;Lorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)Lorg/apache/hadoop/hdfs/BlockReader;+43
j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(Lorg/apache/hadoop/hdfs/net/Peer;)Lorg/apache/hadoop/hdfs/BlockReader;+109
j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp()Lorg/apache/hadoop/hdfs/BlockReader;+78
[...]
ROOT CAUSE: Segmentation fault in a Java process is usually due to a JVM bug.In this case, user is on an older JDK version (1.7.0_67-b01).Updating to a more recent 1.7 release should be attempted to see if it resolves the SIGSEGV.
... View more
Labels:
12-24-2016
12:56 PM
ENVIRONMENT: HDP 2.5.0 , Ambari 2.4.1
ERRORS:- Logs from Ambari server:-
resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/2.5.3.0-37/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf start namenode -rollingUpgrade started'' returned 1. -bash: line 0: ulimit: core file size: cannot modify limit: Operation not permitted
starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-llab90hdpc2m3.out
ROOT CAUSE:- Not yet known,reported as Internal bug. (BUG-70647)
WORKAROUND:- Add the following entries to /etc/security/limits.conf to complete the upgrade.
soft core unlimited
hard core unlimited
OR as root user, run the following command on Ambari server host.
ulimit -c unlimited
Please note that its not a recommended setting for any of the HDP components and just a workaround to complete the upgrade. Please revert the setting afterwards.In case unsure of the execution/implications of this step ,please raise a support case with HWX to assist further.
... View more
Labels:
12-23-2016
10:48 PM
Environment: HDP 2.4.3 , Ambari 2.4.0 SYMPTOMS: Region server logs are as follows:- 2016-10-03 15:13:55,611 INFO [main] regionserver.HRegionServer: STOPPED: Unexpected exception during initialization, aborting2016-10-03 15:13:55,649 ERROR [main] token.AuthenticationTokenSecretManager: Zookeeper initialization failedorg.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /hbase-secure/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:575)at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:554) Zookeeper logs:- 2016-10-04 15:48:45,702 - ERROR [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:SaslServerCallbackHandler@137] - Failed to set name based on Kerberos authentication rules.
org.apache.zookeeper.server.auth.KerberosName$NoMatchingRule: No rules applied to hbase/345.example.net@EXAMPLE.NET
at org.apache.zookeeper.server.auth.KerberosName.getShortName(KerberosName.java:402)
at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:127)
at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handle(SaslServerCallbackHandler.java:83)
at com.sun.security.sasl.gsskerb.GssKrb5Server.doHandshake2(GssKrb5Server.java:317) ACL entries in Zookeeper servers:- 123.example.net:2181(CONNECTED) 0] getAcl /hbase-secure
'world,'anyone
: r
'sasl,’hbase/345.example.net@EXAMPLE.NET
: cdrwa
'sasl,'hbase/345.example.net@EXAMPLE.NET
: cdrwa
ROOT CAUSE: Ideally ACLs should not be defined along with hostnames as part of principal as it may cause issues when another node takes role as master or during rolling restart of services. In this case, it was set such a way because of a bug in Ambari (AMBARI-18528) which mangled translation based on zookeeper.security.auth_to_local in zookeeper-env.sh. Please go through this bug to get the required workaround and other details. (adding back slash in front of dollar sign in the respective rule) But why was authentication failing despite a kinit using exactly same principal as defined in Zookeeper ACL ? The answer lies in this setting in zoo.cfg:- kerberos.removeHostFromPrincipal=true
kerberos.removeRealmFromPrincipal=true These two settings ensure that every authenticated principal for zookeeper is stripped off its hostname as well as realm and only a short name is used by Zookeeper server. But tricky part is, this does not apply to setAcl API. SOLUTION: Please note that our regular “rmr” command to delete HBase znode would fail with “Authentication is not valid” errors. Thus we need few alternatives, one such method is this link . Also try using Java system property zookeeper.skipACL=true in zookeeper env.sh file. However if this does not work, we need to delete existing znode through some forceful methods such as stopping HBase and deleting entire zookeeper data directory, however please take this step with utmost caution and only if no other service is dependent on zookeeper. Once the HBase znodes have been deleted, use the workaround given in AMBARI-18528 to populate correct ACL entries and finally start HBase.
... View more
Labels:
12-23-2016
04:05 PM
1 Kudo
You can use jmx term. Article here: https://community.hortonworks.com/content/kbentry/61188/enable-jmx-metrics-on-hadoop-using-jmxterm.html
... View more
09-12-2017
03:24 PM
@gsharma can you please advise on this issue I am having: https://community.hortonworks.com/questions/136870/balancer-no-block-has-been-moved-for-5-iterations.html
... View more
12-23-2016
09:02 AM
1 Kudo
SYMPTOMS: No visible errors in Resource manager / Node Manager logs for any resource bottleneck.Logs from container/task which is not progressing are as follows:- Error: java.io.IOException: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost;
check server and network status [System error: Socket closed] at
org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:173) at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:523) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:791) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost; check server and network status [System error: Socket closed] at
com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException.createException(SQLNonTransientConnectionException.java:40) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:252) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:214) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.generateSQLException(SQLExceptionSapDB.java:166) at
com.sap.db.jdbc.exceptions.ConnectionException.createException(ConnectionException.java:22) at
com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:1117) at
com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:877) at
com.sap.db.jdbc.ConnectionSapDB.commitInternal(ConnectionSapDB.java:353) at
com.sap.db.jdbc.ConnectionSapDB.commit(ConnectionSapDB.java:340) at
com.sap.db.jdbc.trace.Connection.commit(Connection.java:126) at
org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:169) ... 8 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 ROOT CAUSE: The issue looks to be at SAP HANA side and not at HDP end. Following URL discusses same error - > https://archive.sap.com/discussions/thread/3675080 NEXT STEPS: Contact SAP HANA support team for further troubleshooting.
... View more
Labels:
12-23-2016
07:53 AM
SYMPTOM:
All the services in the cluster are down and restarting the services fails with the following error: 2016-11-17 21:42:18,235 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode.
java.io.IOException: Login failure for nn/lnx21131.examplet.ex.com@EXAMPLE.AD.EX.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Identifier doesn't match expected value (906) Regeneration of Keytabs using Ambari too failed as follows: 17 Nov 2016 23:58:59,136 WARN [Server Action Executor Worker 12702] CreatePrincipalsServerAction:233 - Principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM, does not exist, creating new principal
17 Nov 2016 23:58:59,151 ERROR [Server Action Executor Worker 12702] CreatePrincipalsServerAction:284 - Failed to create or update principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM - Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM
org.apache.ambari.server.serveraction.kerberos.KerberosOperationException: Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM
Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 - 00002071: UpdErr: DSID-0305038D, problem 6005 (ENTRY_EXISTS), data 0
]; remaining name '"cn=HTTP/lnx21142.examplet.ex.com,OU=Hadoop,OU=EXAMPLE_Users,DC=examplet,DC=ad,DC=ex,DC=com"' ROOT CAUSE:
Wrong entries in all service accounts(VPN) in AD. Character '/' was replaced with '_' by a wrong script. RESOLUTION: Fix the issue in the AD service accounts. In the above case, all '_' was replaced with '/' in the service accounts in AD.
... View more
Labels:
12-22-2016
07:49 PM
PROBLEM: Unable to start Resource Manager which fails with below errors:- STARTUP_MSG: build = git@github.com:hortonworks/hadoop.git -r 9e75108092247d96ce7d70839b6945e9eba2a0b7; compiled by 'jenkins' on 2014-11-04T04:31ZSTARTUP_MSG:
java = 1.7.0_67************************************************************/2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT]2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED;
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user
at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user..
2014-11-04 08:41:10,641 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1077)) - Transitioning to standby state
2014-11-04 08:41:10,642 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1087)) - Transitioned to standby state
2014-11-04 08:41:10,643 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1233)) - Error starting ResourceManagerorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to loginat org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from userat org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user ROOT CAUSE: This issue is caused because active RM is using user principal of other standby RM and vice versa.This is reported in bug YARN-2805 , HDP bug BUG-26831. The bugs have been resolved now. SOLUTION: If you are on HDP 2.2.0 , raise a support case with HWX to get a Hotfix. WORKAROUND: Hardcode the principal entry “rm/_HOST@EXAMPLE.COM" in Yarn configuration in Ambari by replacing “_HOST” part with actual hostname of active and standby resource manager respectively.
... View more
Labels:
12-22-2016
03:28 PM
Consider increasing network capacity to overcome
the challenge caused due to non locality of blocks. Create configuration groups of datanodes exclusively for HBASE,
disabling HDFS balancer on this group and allow only hbase balancer. Follow
this url Host_Config_Groups to create host config groups. Few temporary workarounds can also be applied if problem is
severe and need immediate attention :- Disable HDFS balancer permanently on the cluster and run it
manually on need basis. (Please spin a support case and have the situation
discussed before implementing this workaround.) In case the performance issue needs to be fixed post running of
HDFS Balancer, a major compaction could be manually initiated. For performance
gains, major compaction is run on off peak hours such as weekends. This article Compaction_Best_Practices is a recommended read here. Scheduling major compaction after scheduled balancer rather than
vice versa. HDFS although has introduced concept of "favored
nodes" feature but HBase APIs are not yet equipped to choose specific
nodes during data writing. Please note that these are expert level configurations and
procedures, if unsure of their implications, its always recommended to open a
support case with us. Refer following Apache URLs to track the progress of region
blocks pinning implementation. https://issues.apache.org/jira/browse/HBASE-13021 https://issues.apache.org/jira/browse/HDFS-6133
... View more
- « Previous
- Next »