About gsharma

gsharma · ‎12-24-2016

SYMPTOMS: /tmp filling up causes multiple services to stop functioning. ROOT CAUSE: The issue happens due to internal Smartsense bug ST-2551. SOLUTION: Upgrade to Smartsense 1.3.1 WORKAROUND: To workaround this issue we need to manually modify two files related to Smartsense, so that the tmp files will not be generated in /tmp directory anymore 1. File : /usr/hdp/share/hst/hst-agent/lib/hst_agent/anonymize.py Change from : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\ " -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\ " -cp {1} {2} {3}" Change to : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\ " -Djava.io.tmpdir=/grid/02/smartsense/hst-agent/data/tmp" +\ " -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\ " -cp {1} {2} {3}" Make sure the tmp dir value is same as this property agent.tmp_dir in hst-agent-conf. 2. File : /usr/sbin/hst-server.py Change from : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\ "java -server -XX:NewRatio=3 "\ "-XX:+UseConcMarkSweepGC " +\ "-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\ debug_options +\ " -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\ " com.hortonworks.support.tools.server.SupportToolServer "\ ">" + SERVER_OUT_FILE + " 2>&1 &" Change to : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\ "java -server -XX:NewRatio=3 "\ "-XX:+UseConcMarkSweepGC " +\ "-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\ "-Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp " +\ debug_options +\ " -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\ " com.hortonworks.support.tools.server.SupportToolServer "\ ">" + SERVER_OUT_FILE + " 2>&1 &" Make sure the tmp dir value is same as this property server.tmp.dir in hst-server-conf. 3. After above changes, pease clean up existing .pyc files from both of the above directories, and restart Smartsense server and agents for changes to take effect.

gsharma · ‎12-24-2016

ENVIRONMENT: All Ambari versions prior to 2.4.x SYMPTOMS: Intermittent loss of heartbeat to cluster nodes, freeze of ambari-agent service, intermittent issues in Ambari alerts and service status updates in Ambari dashboard. Ambari-agent logs:- INFO 2016-08-21 19:10:20,080 Heartbeat.py:78 - Building Heartbeat: {responseId = 139566, timestamp = 1471821020080, commandsInProgress = False, componentsMapped = True}ERROR 2016-08-21 19:10:20,102 HostInfo.py:228 - Checking java processes failedTraceback (most recent call last): File "/usr/lib/python2.6/site-packages/ambari_agent/HostInfo.py", line 211, in javaProcs cmd = open(os.path.join('/proc', pid, 'cmdline'), 'rb').read()IOError: [Errno 2] No such file or directory: '/proc/24270/cmdline' Top command output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP TIME DATA COMMAND 10098 root 20 0 54.4g 53g 4540 S 54.5 14.0 18000:11 224 300,00 54g /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start --expected-hostname=123.example.com ROOT CAUSE: Race condition in subprocess python module. Due to this race condition, at some unlucky cases python garbage collection was disabled. This usually happened when running alerts, as a bunch of our alerts run shell commands and they do it in different threads. This is a known issue reported in AMBARI-17539. SOLUTION: Upgrade to Ambari 2.4.x WORKAROUND: Restart ambari-agent which would fix issue temporarily. Log a case with HWX support to get a patch for the bug fix.

gsharma · ‎12-24-2016

ENVIRONMENT: HDP 2.3.2 , Ambari 2.2.0,JDK 1.7.0_67-b01,Kernel: 3.13.0-48-generic ERRORS: Last few lines in the NM log before it hit SIGSEGV shows that there was Container Localizer running for a new container: 2016-10-20 01:29:05,810 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(711)) - Created localizer for container_e14_1475595980406_28807_01_000021 [...] 2016-10-20 01:29:08,308 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://user/tmp/hive/xxx/5b0f04c6-ba2d-47dc-85c2-88179a1db407/hive_2016-10-20_01-28-15_091_3309851709548218363-3928/-mr-10007/df6632b4-ec58-4cdf-8ffb-c81460abc266/reduce.xml(->/hadoop/yarn/local/usercache/xxx/filecache/150663/reduce.xml) transitioned from DOWNLOADING to LOCALIZED - The exception says: Current thread (0x00007f2c66cc7000): JavaThread "ContainerLocalizer Downloader" [_thread_in_Java, id=14260, stack(0x00007f2c740a3000,0x00007f2c741a4000)] siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000801f0ffb - And the stack trace for '14260' shows: Stack: [0x00007f2c740a3000,0x00007f2c741a4000], sp=0x00007f2c741a0fc8, free space=1015kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getClientNameBytes()Lcom/google/protobuf/ByteString;+0 j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$ClientOperationHeaderProto.getSerializedSize()I+48 J 915 C2 com.google.protobuf.CodedOutputStream.computeMessageSize(ILcom/google/protobuf/MessageLite;)I (10 bytes) @ 0x00007f2cad207530 [0x00007f2cad207500+0x30] j org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$OpReadBlockProto.getSerializedSize()I+30 J 975 C2 com.google.protobuf.AbstractMessageLite.writeDelimitedTo(Ljava/io/OutputStream;)V (40 bytes) @ 0x00007f2cad254124 [0x00007f2cad2540e0+0x44] j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.send(Ljava/io/DataOutputStream;Lorg/apache/hadoop/hdfs/protocol/datatransfer/Op;Lcom/google/protobuf/Message;)V+60 j org.apache.hadoop.hdfs.protocol.datatransfer.Sender.readBlock(Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token;Ljava/lang/String; JJZLorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)V+49 j org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(Ljava/lang/String;Lorg/apache/hadoop/hdfs/protocol/ExtendedBlock;Lorg/apache/hadoop/security/token/Token; JJZLjava/lang/String;Lorg/apache/hadoop/hdfs/net/Peer;Lorg/apache/hadoop/hdfs/protocol/DatanodeID;Lorg/apache/hadoop/hdfs/PeerCache;Lorg/apache/hadoop/hdfs/server/datanode/CachingStrategy;)Lorg/apache/hadoop/hdfs/BlockReader;+43 j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(Lorg/apache/hadoop/hdfs/net/Peer;)Lorg/apache/hadoop/hdfs/BlockReader;+109 j org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp()Lorg/apache/hadoop/hdfs/BlockReader;+78 [...] ROOT CAUSE: Segmentation fault in a Java process is usually due to a JVM bug.In this case, user is on an older JDK version (1.7.0_67-b01).Updating to a more recent 1.7 release should be attempted to see if it resolves the SIGSEGV.

gsharma · ‎12-24-2016

ENVIRONMENT: HDP 2.5.0 , Ambari 2.4.1 ERRORS:- Logs from Ambari server:- resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/2.5.3.0-37/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf start namenode -rollingUpgrade started'' returned 1. -bash: line 0: ulimit: core file size: cannot modify limit: Operation not permitted starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-llab90hdpc2m3.out ROOT CAUSE:- Not yet known,reported as Internal bug. (BUG-70647) WORKAROUND:- Add the following entries to /etc/security/limits.conf to complete the upgrade. soft core unlimited hard core unlimited OR as root user, run the following command on Ambari server host. ulimit -c unlimited Please note that its not a recommended setting for any of the HDP components and just a workaround to complete the upgrade. Please revert the setting afterwards.In case unsure of the execution/implications of this step ,please raise a support case with HWX to assist further.

gsharma · ‎12-23-2016

Environment: HDP 2.4.3 , Ambari 2.4.0 SYMPTOMS: Region server logs are as follows:- 2016-10-03 15:13:55,611 INFO [main] regionserver.HRegionServer: STOPPED: Unexpected exception during initialization, aborting2016-10-03 15:13:55,649 ERROR [main] token.AuthenticationTokenSecretManager: Zookeeper initialization failedorg.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /hbase-secure/tokenauth/keys at org.apache.zookeeper.KeeperException.create(KeeperException.java:113) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:575)at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:554) Zookeeper logs:- 2016-10-04 15:48:45,702 - ERROR [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:SaslServerCallbackHandler@137] - Failed to set name based on Kerberos authentication rules. org.apache.zookeeper.server.auth.KerberosName$NoMatchingRule: No rules applied to hbase/345.example.net@EXAMPLE.NET at org.apache.zookeeper.server.auth.KerberosName.getShortName(KerberosName.java:402) at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:127) at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handle(SaslServerCallbackHandler.java:83) at com.sun.security.sasl.gsskerb.GssKrb5Server.doHandshake2(GssKrb5Server.java:317) ACL entries in Zookeeper servers:- 123.example.net:2181(CONNECTED) 0] getAcl /hbase-secure 'world,'anyone : r 'sasl,’hbase/345.example.net@EXAMPLE.NET : cdrwa 'sasl,'hbase/345.example.net@EXAMPLE.NET : cdrwa ROOT CAUSE: Ideally ACLs should not be defined along with hostnames as part of principal as it may cause issues when another node takes role as master or during rolling restart of services. In this case, it was set such a way because of a bug in Ambari (AMBARI-18528) which mangled translation based on zookeeper.security.auth_to_local in zookeeper-env.sh. Please go through this bug to get the required workaround and other details. (adding back slash in front of dollar sign in the respective rule) But why was authentication failing despite a kinit using exactly same principal as defined in Zookeeper ACL ? The answer lies in this setting in zoo.cfg:- kerberos.removeHostFromPrincipal=true kerberos.removeRealmFromPrincipal=true These two settings ensure that every authenticated principal for zookeeper is stripped off its hostname as well as realm and only a short name is used by Zookeeper server. But tricky part is, this does not apply to setAcl API. SOLUTION: Please note that our regular “rmr” command to delete HBase znode would fail with “Authentication is not valid” errors. Thus we need few alternatives, one such method is this link . Also try using Java system property zookeeper.skipACL=true in zookeeper env.sh file. However if this does not work, we need to delete existing znode through some forceful methods such as stopping HBase and deleting entire zookeeper data directory, however please take this step with utmost caution and only if no other service is dependent on zookeeper. Once the HBase znodes have been deleted, use the workaround given in AMBARI-18528 to populate correct ACL entries and finally start HBase.

gsharma · ‎12-23-2016

We could get these stats from Grafana , but is there any way this could be exported in text / xml or any other format ?

gsharma · ‎12-23-2016

PROBLEM: Balancer fails in few minutes without any block movement. SYMPTOMS: Following are the messages balancer exits with:- 16/11/22 07:08:29 DEBUG ipc.Client: IPC Client (280134559) connection to xxx.corp.example.com/0.0.0.0:8020 from hdfs-EST@HADOOP.XXX.CORP.EXAMPLE.COM got value #1193 16/11/22 07:08:29 DEBUG ipc.ProtobufRpcEngine: Call: getBlocks took 2486ms No block has been moved for 5 iterations. Exiting...Nov 22, 2016 7:08:29 AM 4 0 B 35.86 TB 200 GB ROOT CAUSE: The rack distribution looked like below:- /default-rack : 91 /Example1 : 18 /Example2 : 2 The 100% utilized nodes which we were trying to balance to create space were those 20 nodes registered with racks /Example1 and /Example2.Thus based on following rack awareness rules in balancer (rule#3 for this issue) for block placement, it was not at all possible for even a single block to move compromising fault tolerance. /** * Decide if the block is a good candidate to be moved from source to target. * A block is a good candidate if * 1. the block is not in the process of being moved/has not been moved; * 2. the block does not have a replica on the target; * 3. doing the move does not reduce the number of racks that the block has */ SOLUTION: Distribute nodes evenly across all racks.If this is not possible add additional storage to respective nodes OR add new datanodes to the respective racks.

gsharma · ‎12-23-2016

SYMPTOMS: No visible errors in Resource manager / Node Manager logs for any resource bottleneck.Logs from container/task which is not progressing are as follows:- Error: java.io.IOException: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost; check server and network status [System error: Socket closed] at org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:173) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:523) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:791) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost; check server and network status [System error: Socket closed] at com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException.createException(SQLNonTransientConnectionException.java:40) at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:252) at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:214) at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.generateSQLException(SQLExceptionSapDB.java:166) at com.sap.db.jdbc.exceptions.ConnectionException.createException(ConnectionException.java:22) at com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:1117) at com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:877) at com.sap.db.jdbc.ConnectionSapDB.commitInternal(ConnectionSapDB.java:353) at com.sap.db.jdbc.ConnectionSapDB.commit(ConnectionSapDB.java:340) at com.sap.db.jdbc.trace.Connection.commit(Connection.java:126) at org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:169) ... 8 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 ROOT CAUSE: The issue looks to be at SAP HANA side and not at HDP end. Following URL discusses same error - > https://archive.sap.com/discussions/thread/3675080 NEXT STEPS: Contact SAP HANA support team for further troubleshooting.

gsharma · ‎12-23-2016

SYMPTOM: All the services in the cluster are down and restarting the services fails with the following error: 2016-11-17 21:42:18,235 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode. java.io.IOException: Login failure for nn/lnx21131.examplet.ex.com@EXAMPLE.AD.EX.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6) ... Caused by: KrbException: Client not found in Kerberos database (6) ... Caused by: KrbException: Identifier doesn't match expected value (906) Regeneration of Keytabs using Ambari too failed as follows: 17 Nov 2016 23:58:59,136 WARN [Server Action Executor Worker 12702] CreatePrincipalsServerAction:233 - Principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM, does not exist, creating new principal 17 Nov 2016 23:58:59,151 ERROR [Server Action Executor Worker 12702] CreatePrincipalsServerAction:284 - Failed to create or update principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM - Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM org.apache.ambari.server.serveraction.kerberos.KerberosOperationException: Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 - 00002071: UpdErr: DSID-0305038D, problem 6005 (ENTRY_EXISTS), data 0 ]; remaining name '"cn=HTTP/lnx21142.examplet.ex.com,OU=Hadoop,OU=EXAMPLE_Users,DC=examplet,DC=ad,DC=ex,DC=com"' ROOT CAUSE: Wrong entries in all service accounts(VPN) in AD. Character '/' was replaced with '_' by a wrong script. RESOLUTION: Fix the issue in the AD service accounts. In the above case, all '_' was replaced with '/' in the service accounts in AD.

gsharma · ‎12-22-2016

PROBLEM: Unable to start Resource Manager which fails with below errors:- STARTUP_MSG: build = git@github.com:hortonworks/hadoop.git -r 9e75108092247d96ce7d70839b6945e9eba2a0b7; compiled by 'jenkins' on 2014-11-04T04:31ZSTARTUP_MSG: java = 1.7.0_67************************************************************/2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT]2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229) Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user.. 2014-11-04 08:41:10,641 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1077)) - Transitioning to standby state 2014-11-04 08:41:10,642 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1087)) - Transitioned to standby state 2014-11-04 08:41:10,643 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1233)) - Error starting ResourceManagerorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to loginat org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from userat org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935) at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user ROOT CAUSE: This issue is caused because active RM is using user principal of other standby RM and vice versa.This is reported in bug YARN-2805 , HDP bug BUG-26831. The bugs have been resolved now. SOLUTION: If you are on HDP 2.2.0 , raise a support case with HWX to get a Hotfix. WORKAROUND: Hardcode the principal entry “rm/_HOST@EXAMPLE.COM" in Yarn configuration in Ambari by replacing “_HOST” part with actual hostname of active and standby resource manager respectively.

Online	Offline
Last Visited	‎09-24-2020 03:11 PM

Member Since	‎03-01-2016 07:18 AM
Last Visited	‎09-24-2020 03:11 PM
Posts	104
Kudos received	97

Cloudera Community

Re: Upgrading the cluster - OS, HDP, resources

Re: Zookeeper not running

Re: How to disable user limits in Yarn Capacity Sc...

/tmp filling up due to Smartsense files.

Very high memory utilization by Ambari Agent

Node Manager crashed due to segmentation fault.

Namenode fails to start in rolling upgrade from HD...

HBase does not start after enabling Kerberos on cl...

how to get JMX data for any service at a particula...

Balancer failing with "No Block has been Moved" Er...

Unable to import data from SAP HANA to Hadoop

Cluster is down due to Kerberos Principals mismatc...

Unable to start Yarn RM , fails with error "Unable...