Member since
03-01-2016
104
Posts
97
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1667 | 06-03-2018 09:22 PM | |
28170 | 05-21-2018 10:31 PM | |
2175 | 10-19-2016 07:13 AM |
12-24-2016
02:36 PM
SYMPTOMS: /tmp filling up causes multiple services to stop functioning. ROOT CAUSE: The issue happens due to internal Smartsense bug ST-2551. SOLUTION: Upgrade to Smartsense 1.3.1 WORKAROUND: To workaround this issue we need to manually modify two files related to Smartsense, so that the tmp files will not be generated in /tmp directory anymore 1. File : /usr/hdp/share/hst/hst-agent/lib/hst_agent/anonymize.py Change from : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\
" -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\ " -cp {1} {2} {3}" Change to : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\
" -Djava.io.tmpdir=/grid/02/smartsense/hst-agent/data/tmp" +\
" -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\
" -cp {1} {2} {3}" Make sure the tmp dir value is same as this property agent.tmp_dir in hst-agent-conf. 2. File : /usr/sbin/hst-server.py Change from : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\ "java -server -XX:NewRatio=3 "\
"-XX:+UseConcMarkSweepGC " +\
"-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\
debug_options +\
" -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\
" com.hortonworks.support.tools.server.SupportToolServer "\
">" + SERVER_OUT_FILE + " 2>&1 &" Change to : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\
"java -server -XX:NewRatio=3 "\ "-XX:+UseConcMarkSweepGC " +\
"-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\
"-Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp " +\
debug_options +\
" -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\
" com.hortonworks.support.tools.server.SupportToolServer "\
">" + SERVER_OUT_FILE + " 2>&1 &" Make sure the tmp dir value is same as this property server.tmp.dir in hst-server-conf. 3. After above changes, pease clean up existing .pyc files from both of the above directories, and restart Smartsense server and agents for changes to take effect.
... View more
Labels:
12-24-2016
02:00 PM
ENVIRONMENT: All Ambari versions prior to 2.4.x SYMPTOMS: Intermittent loss of heartbeat to cluster nodes, freeze of ambari-agent service, intermittent issues in Ambari alerts and service status updates in Ambari dashboard. Ambari-agent logs:- INFO 2016-08-21 19:10:20,080 Heartbeat.py:78 - Building Heartbeat: {responseId = 139566, timestamp = 1471821020080, commandsInProgress = False, componentsMapped = True}ERROR
2016-08-21 19:10:20,102 HostInfo.py:228 - Checking java processes failedTraceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/HostInfo.py", line 211, in javaProcs cmd = open(os.path.join('/proc', pid, 'cmdline'), 'rb').read()IOError: [Errno 2] No such file or directory: '/proc/24270/cmdline' Top command output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP TIME DATA COMMAND 10098 root 20 0 54.4g 53g 4540 S 54.5 14.0 18000:11 224 300,00 54g /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start --expected-hostname=123.example.com ROOT CAUSE: Race condition in subprocess python module. Due to this race condition, at some unlucky cases python garbage collection was disabled. This usually happened when running alerts, as a bunch of our alerts run shell commands and they do it in different threads. This is a known issue reported in AMBARI-17539. SOLUTION: Upgrade to Ambari 2.4.x WORKAROUND: Restart ambari-agent which would fix issue temporarily. Log a case with HWX support to get a patch for the bug fix.
... View more
Labels:
12-24-2016
01:24 PM
ENVIRONMENT: HDP 2.3.2 , Ambari 2.2.0,JDK 1.7.0_67-b01,Kernel: 3.13.0-48-generic
ERRORS: Last few lines in the NM log before it hit SIGSEGV shows that there was Container Localizer running for a new container:
2016-10-20 01:29:05,810 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(711)) - Created localizer for container_e14_1475595980406_28807_01_000021
[...]
2016-10-20 01:29:08,308 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://user/tmp/hive/xxx/5b0f04c6-ba2d-47dc-85c2-88179a1db407/hive_2016-10-20_01-28-15_091_3309851709548218363-3928/-mr-10007/df6632b4-ec58-4cdf-8ffb-c81460abc266/reduce.xml(->/hadoop/yarn/local/usercache/xxx/filecache/150663/reduce.xml) transitioned from DOWNLOADING to LOCALIZED
- The exception says:
Current thread (0x00007f2c66cc7000): JavaThread "ContainerLocalizer Downloader" [_thread_in_Java, id=14260, stack(0x00007f2c740a3000,0x00007f2c741a4000)]
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000801f0ffb
- And the stack trace for '14260' shows:
Stack: [0x00007f2c740a3000,0x00007f2c741a4000], sp=0x00007f2c741a0fc8, free space=1015kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getClientNameBytes()Lcom/google/protobuf/ByteString;+0
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getSerializedSize()I+48 J 915 C2 com.google.protobuf.CodedOutputStream.computeMessageSize(ILcom/google/protobuf/MessageLite;)I (10 bytes) @ 0x00007f2cad207530 [0x00007f2cad207500+0x30]
j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$OpReadBlockProto.getSerializedSize()I+30 J 975 C2 com.google.protobuf.AbstractMessageLite.writeDelimitedTo(Ljava/io/OutputStream;)V (40 bytes) @ 0x00007f2cad254124 [0x00007f2cad2540e0+0x44]
j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.send(Ljava/io/DataOutputStream;Lorg/apache/hadoop/hdfs/protocol/datatransfer/Op;Lcom/google/protobuf/Message;)V+60
j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.readBlock(Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token;Ljava/lang/String;
JJZLorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)V+49
j org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(Ljava/lang/String;Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token;
JJZLjava/lang/String;Lorg/apache/hadoop/hdfs/net/Peer;Lorg/apache/hadoop/hdfs/protocol/DatanodeID;Lorg/apache/hadoop/hdfs/PeerCache;Lorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)Lorg/apache/hadoop/hdfs/BlockReader;+43
j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(Lorg/apache/hadoop/hdfs/net/Peer;)Lorg/apache/hadoop/hdfs/BlockReader;+109
j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp()Lorg/apache/hadoop/hdfs/BlockReader;+78
[...]
ROOT CAUSE: Segmentation fault in a Java process is usually due to a JVM bug.In this case, user is on an older JDK version (1.7.0_67-b01).Updating to a more recent 1.7 release should be attempted to see if it resolves the SIGSEGV.
... View more
Labels:
12-24-2016
12:56 PM
ENVIRONMENT: HDP 2.5.0 , Ambari 2.4.1
ERRORS:- Logs from Ambari server:-
resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/2.5.3.0-37/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf start namenode -rollingUpgrade started'' returned 1. -bash: line 0: ulimit: core file size: cannot modify limit: Operation not permitted
starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-llab90hdpc2m3.out
ROOT CAUSE:- Not yet known,reported as Internal bug. (BUG-70647)
WORKAROUND:- Add the following entries to /etc/security/limits.conf to complete the upgrade.
soft core unlimited
hard core unlimited
OR as root user, run the following command on Ambari server host.
ulimit -c unlimited
Please note that its not a recommended setting for any of the HDP components and just a workaround to complete the upgrade. Please revert the setting afterwards.In case unsure of the execution/implications of this step ,please raise a support case with HWX to assist further.
... View more
Labels:
12-23-2016
10:48 PM
Environment: HDP 2.4.3 , Ambari 2.4.0 SYMPTOMS: Region server logs are as follows:- 2016-10-03 15:13:55,611 INFO [main] regionserver.HRegionServer: STOPPED: Unexpected exception during initialization, aborting2016-10-03 15:13:55,649 ERROR [main] token.AuthenticationTokenSecretManager: Zookeeper initialization failedorg.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /hbase-secure/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:575)at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:554) Zookeeper logs:- 2016-10-04 15:48:45,702 - ERROR [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:SaslServerCallbackHandler@137] - Failed to set name based on Kerberos authentication rules.
org.apache.zookeeper.server.auth.KerberosName$NoMatchingRule: No rules applied to hbase/345.example.net@EXAMPLE.NET
at org.apache.zookeeper.server.auth.KerberosName.getShortName(KerberosName.java:402)
at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:127)
at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handle(SaslServerCallbackHandler.java:83)
at com.sun.security.sasl.gsskerb.GssKrb5Server.doHandshake2(GssKrb5Server.java:317) ACL entries in Zookeeper servers:- 123.example.net:2181(CONNECTED) 0] getAcl /hbase-secure
'world,'anyone
: r
'sasl,’hbase/345.example.net@EXAMPLE.NET
: cdrwa
'sasl,'hbase/345.example.net@EXAMPLE.NET
: cdrwa
ROOT CAUSE: Ideally ACLs should not be defined along with hostnames as part of principal as it may cause issues when another node takes role as master or during rolling restart of services. In this case, it was set such a way because of a bug in Ambari (AMBARI-18528) which mangled translation based on zookeeper.security.auth_to_local in zookeeper-env.sh. Please go through this bug to get the required workaround and other details. (adding back slash in front of dollar sign in the respective rule) But why was authentication failing despite a kinit using exactly same principal as defined in Zookeeper ACL ? The answer lies in this setting in zoo.cfg:- kerberos.removeHostFromPrincipal=true
kerberos.removeRealmFromPrincipal=true These two settings ensure that every authenticated principal for zookeeper is stripped off its hostname as well as realm and only a short name is used by Zookeeper server. But tricky part is, this does not apply to setAcl API. SOLUTION: Please note that our regular “rmr” command to delete HBase znode would fail with “Authentication is not valid” errors. Thus we need few alternatives, one such method is this link . Also try using Java system property zookeeper.skipACL=true in zookeeper env.sh file. However if this does not work, we need to delete existing znode through some forceful methods such as stopping HBase and deleting entire zookeeper data directory, however please take this step with utmost caution and only if no other service is dependent on zookeeper. Once the HBase znodes have been deleted, use the workaround given in AMBARI-18528 to populate correct ACL entries and finally start HBase.
... View more
Labels:
12-23-2016
03:59 PM
We could get these stats from Grafana , but is there any way this could be exported in text / xml or any other format ?
... View more
12-23-2016
11:24 AM
PROBLEM: Balancer fails in few minutes without any block movement.
SYMPTOMS: Following are the messages balancer exits with:-
16/11/22 07:08:29 DEBUG ipc.Client: IPC Client (280134559) connection to xxx.corp.example.com/0.0.0.0:8020 from hdfs-EST@HADOOP.XXX.CORP.EXAMPLE.COM got value #1193
16/11/22 07:08:29 DEBUG ipc.ProtobufRpcEngine: Call: getBlocks took 2486ms
No block has been moved for 5 iterations. Exiting...Nov 22, 2016 7:08:29 AM
4 0 B 35.86 TB 200 GB
ROOT CAUSE: The rack distribution looked like below:-
/default-rack : 91
/Example1 : 18
/Example2 : 2
The 100% utilized nodes which we were trying to balance to create space were those 20 nodes registered with racks /Example1 and /Example2.Thus based on following rack awareness rules in balancer (rule#3 for this issue) for block placement, it was not at all possible for even a single block to move compromising fault tolerance.
/** * Decide if the block is a good candidate to be moved from source to target.
* A block is a good candidate if
* 1. the block is not in the process of being moved/has not been moved;
* 2. the block does not have a replica on the target;
* 3. doing the move does not reduce the number of racks that the block has */
SOLUTION: Distribute nodes evenly across all racks.If this is not possible add additional storage to respective nodes OR add new datanodes to the respective racks.
... View more
Labels:
12-23-2016
09:02 AM
1 Kudo
SYMPTOMS: No visible errors in Resource manager / Node Manager logs for any resource bottleneck.Logs from container/task which is not progressing are as follows:- Error: java.io.IOException: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost;
check server and network status [System error: Socket closed] at
org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:173) at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:523) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:791) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost; check server and network status [System error: Socket closed] at
com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException.createException(SQLNonTransientConnectionException.java:40) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:252) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:214) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.generateSQLException(SQLExceptionSapDB.java:166) at
com.sap.db.jdbc.exceptions.ConnectionException.createException(ConnectionException.java:22) at
com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:1117) at
com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:877) at
com.sap.db.jdbc.ConnectionSapDB.commitInternal(ConnectionSapDB.java:353) at
com.sap.db.jdbc.ConnectionSapDB.commit(ConnectionSapDB.java:340) at
com.sap.db.jdbc.trace.Connection.commit(Connection.java:126) at
org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:169) ... 8 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 ROOT CAUSE: The issue looks to be at SAP HANA side and not at HDP end. Following URL discusses same error - > https://archive.sap.com/discussions/thread/3675080 NEXT STEPS: Contact SAP HANA support team for further troubleshooting.
... View more
Labels:
12-23-2016
07:53 AM
SYMPTOM:
All the services in the cluster are down and restarting the services fails with the following error: 2016-11-17 21:42:18,235 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode.
java.io.IOException: Login failure for nn/lnx21131.examplet.ex.com@EXAMPLE.AD.EX.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Identifier doesn't match expected value (906) Regeneration of Keytabs using Ambari too failed as follows: 17 Nov 2016 23:58:59,136 WARN [Server Action Executor Worker 12702] CreatePrincipalsServerAction:233 - Principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM, does not exist, creating new principal
17 Nov 2016 23:58:59,151 ERROR [Server Action Executor Worker 12702] CreatePrincipalsServerAction:284 - Failed to create or update principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM - Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM
org.apache.ambari.server.serveraction.kerberos.KerberosOperationException: Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM
Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 - 00002071: UpdErr: DSID-0305038D, problem 6005 (ENTRY_EXISTS), data 0
]; remaining name '"cn=HTTP/lnx21142.examplet.ex.com,OU=Hadoop,OU=EXAMPLE_Users,DC=examplet,DC=ad,DC=ex,DC=com"' ROOT CAUSE:
Wrong entries in all service accounts(VPN) in AD. Character '/' was replaced with '_' by a wrong script. RESOLUTION: Fix the issue in the AD service accounts. In the above case, all '_' was replaced with '/' in the service accounts in AD.
... View more
Labels:
12-22-2016
07:49 PM
PROBLEM: Unable to start Resource Manager which fails with below errors:- STARTUP_MSG: build = git@github.com:hortonworks/hadoop.git -r 9e75108092247d96ce7d70839b6945e9eba2a0b7; compiled by 'jenkins' on 2014-11-04T04:31ZSTARTUP_MSG:
java = 1.7.0_67************************************************************/2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT]2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED;
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user
at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user..
2014-11-04 08:41:10,641 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1077)) - Transitioning to standby state
2014-11-04 08:41:10,642 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1087)) - Transitioned to standby state
2014-11-04 08:41:10,643 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1233)) - Error starting ResourceManagerorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to loginat org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from userat org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user ROOT CAUSE: This issue is caused because active RM is using user principal of other standby RM and vice versa.This is reported in bug YARN-2805 , HDP bug BUG-26831. The bugs have been resolved now. SOLUTION: If you are on HDP 2.2.0 , raise a support case with HWX to get a Hotfix. WORKAROUND: Hardcode the principal entry “rm/_HOST@EXAMPLE.COM" in Yarn configuration in Ambari by replacing “_HOST” part with actual hostname of active and standby resource manager respectively.
... View more
Labels: