About gsharma

gsharma · ‎12-24-2016

ENVIRONMENT: HDP 2.5.0 , Ambari 2.4.1 ERRORS:- Logs from Ambari server:- resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/2.5.3.0-37/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf start namenode -rollingUpgrade started'' returned 1. -bash: line 0: ulimit: core file size: cannot modify limit: Operation not permitted starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-llab90hdpc2m3.out ROOT CAUSE:- Not yet known,reported as Internal bug. (BUG-70647) WORKAROUND:- Add the following entries to /etc/security/limits.conf to complete the upgrade. soft core unlimited hard core unlimited OR as root user, run the following command on Ambari server host. ulimit -c unlimited Please note that its not a recommended setting for any of the HDP components and just a workaround to complete the upgrade. Please revert the setting afterwards.In case unsure of the execution/implications of this step ,please raise a support case with HWX to assist further.

gsharma · ‎12-23-2016

Environment: HDP 2.4.3 , Ambari 2.4.0 SYMPTOMS: Region server logs are as follows:- 2016-10-03 15:13:55,611 INFO [main] regionserver.HRegionServer: STOPPED: Unexpected exception during initialization, aborting2016-10-03 15:13:55,649 ERROR [main] token.AuthenticationTokenSecretManager: Zookeeper initialization failedorg.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /hbase-secure/tokenauth/keys at org.apache.zookeeper.KeeperException.create(KeeperException.java:113) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:575)at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:554) Zookeeper logs:- 2016-10-04 15:48:45,702 - ERROR [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:SaslServerCallbackHandler@137] - Failed to set name based on Kerberos authentication rules. org.apache.zookeeper.server.auth.KerberosName$NoMatchingRule: No rules applied to hbase/345.example.net@EXAMPLE.NET at org.apache.zookeeper.server.auth.KerberosName.getShortName(KerberosName.java:402) at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:127) at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handle(SaslServerCallbackHandler.java:83) at com.sun.security.sasl.gsskerb.GssKrb5Server.doHandshake2(GssKrb5Server.java:317) ACL entries in Zookeeper servers:- 123.example.net:2181(CONNECTED) 0] getAcl /hbase-secure 'world,'anyone : r 'sasl,’hbase/345.example.net@EXAMPLE.NET : cdrwa 'sasl,'hbase/345.example.net@EXAMPLE.NET : cdrwa ROOT CAUSE: Ideally ACLs should not be defined along with hostnames as part of principal as it may cause issues when another node takes role as master or during rolling restart of services. In this case, it was set such a way because of a bug in Ambari (AMBARI-18528) which mangled translation based on zookeeper.security.auth_to_local in zookeeper-env.sh. Please go through this bug to get the required workaround and other details. (adding back slash in front of dollar sign in the respective rule) But why was authentication failing despite a kinit using exactly same principal as defined in Zookeeper ACL ? The answer lies in this setting in zoo.cfg:- kerberos.removeHostFromPrincipal=true kerberos.removeRealmFromPrincipal=true These two settings ensure that every authenticated principal for zookeeper is stripped off its hostname as well as realm and only a short name is used by Zookeeper server. But tricky part is, this does not apply to setAcl API. SOLUTION: Please note that our regular “rmr” command to delete HBase znode would fail with “Authentication is not valid” errors. Thus we need few alternatives, one such method is this link . Also try using Java system property zookeeper.skipACL=true in zookeeper env.sh file. However if this does not work, we need to delete existing znode through some forceful methods such as stopping HBase and deleting entire zookeeper data directory, however please take this step with utmost caution and only if no other service is dependent on zookeeper. Once the HBase znodes have been deleted, use the workaround given in AMBARI-18528 to populate correct ACL entries and finally start HBase.

pmj · ‎09-12-2017

@gsharma can you please advise on this issue I am having: https://community.hortonworks.com/questions/136870/balancer-no-block-has-been-moved-for-5-iterations.html

gsharma · ‎12-23-2016

SYMPTOMS: No visible errors in Resource manager / Node Manager logs for any resource bottleneck.Logs from container/task which is not progressing are as follows:- Error: java.io.IOException: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost; check server and network status [System error: Socket closed] at org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:173) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:523) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:791) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost; check server and network status [System error: Socket closed] at com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException.createException(SQLNonTransientConnectionException.java:40) at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:252) at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:214) at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.generateSQLException(SQLExceptionSapDB.java:166) at com.sap.db.jdbc.exceptions.ConnectionException.createException(ConnectionException.java:22) at com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:1117) at com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:877) at com.sap.db.jdbc.ConnectionSapDB.commitInternal(ConnectionSapDB.java:353) at com.sap.db.jdbc.ConnectionSapDB.commit(ConnectionSapDB.java:340) at com.sap.db.jdbc.trace.Connection.commit(Connection.java:126) at org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:169) ... 8 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 ROOT CAUSE: The issue looks to be at SAP HANA side and not at HDP end. Following URL discusses same error - > https://archive.sap.com/discussions/thread/3675080 NEXT STEPS: Contact SAP HANA support team for further troubleshooting.

gsharma · ‎12-23-2016

SYMPTOM: All the services in the cluster are down and restarting the services fails with the following error: 2016-11-17 21:42:18,235 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode. java.io.IOException: Login failure for nn/lnx21131.examplet.ex.com@EXAMPLE.AD.EX.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6) ... Caused by: KrbException: Client not found in Kerberos database (6) ... Caused by: KrbException: Identifier doesn't match expected value (906) Regeneration of Keytabs using Ambari too failed as follows: 17 Nov 2016 23:58:59,136 WARN [Server Action Executor Worker 12702] CreatePrincipalsServerAction:233 - Principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM, does not exist, creating new principal 17 Nov 2016 23:58:59,151 ERROR [Server Action Executor Worker 12702] CreatePrincipalsServerAction:284 - Failed to create or update principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM - Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM org.apache.ambari.server.serveraction.kerberos.KerberosOperationException: Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 - 00002071: UpdErr: DSID-0305038D, problem 6005 (ENTRY_EXISTS), data 0 ]; remaining name '"cn=HTTP/lnx21142.examplet.ex.com,OU=Hadoop,OU=EXAMPLE_Users,DC=examplet,DC=ad,DC=ex,DC=com"' ROOT CAUSE: Wrong entries in all service accounts(VPN) in AD. Character '/' was replaced with '_' by a wrong script. RESOLUTION: Fix the issue in the AD service accounts. In the above case, all '_' was replaced with '/' in the service accounts in AD.

gsharma · ‎12-22-2016

PROBLEM: Unable to start Resource Manager which fails with below errors:- STARTUP_MSG: build = git@github.com:hortonworks/hadoop.git -r 9e75108092247d96ce7d70839b6945e9eba2a0b7; compiled by 'jenkins' on 2014-11-04T04:31ZSTARTUP_MSG: java = 1.7.0_67************************************************************/2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT]2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229) Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user.. 2014-11-04 08:41:10,641 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1077)) - Transitioning to standby state 2014-11-04 08:41:10,642 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1087)) - Transitioned to standby state 2014-11-04 08:41:10,643 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1233)) - Error starting ResourceManagerorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to loginat org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from userat org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935) at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user ROOT CAUSE: This issue is caused because active RM is using user principal of other standby RM and vice versa.This is reported in bug YARN-2805 , HDP bug BUG-26831. The bugs have been resolved now. SOLUTION: If you are on HDP 2.2.0 , raise a support case with HWX to get a Hotfix. WORKAROUND: Hardcode the principal entry “rm/_HOST@EXAMPLE.COM" in Yarn configuration in Ambari by replacing “_HOST” part with actual hostname of active and standby resource manager respectively.

gsharma · ‎12-22-2016

Consider increasing network capacity to overcome the challenge caused due to non locality of blocks. Create configuration groups of datanodes exclusively for HBASE, disabling HDFS balancer on this group and allow only hbase balancer. Follow this url Host_Config_Groups to create host config groups. Few temporary workarounds can also be applied if problem is severe and need immediate attention :- Disable HDFS balancer permanently on the cluster and run it manually on need basis. (Please spin a support case and have the situation discussed before implementing this workaround.) In case the performance issue needs to be fixed post running of HDFS Balancer, a major compaction could be manually initiated. For performance gains, major compaction is run on off peak hours such as weekends. This article Compaction_Best_Practices is a recommended read here. Scheduling major compaction after scheduled balancer rather than vice versa. HDFS although has introduced concept of "favored nodes" feature but HBase APIs are not yet equipped to choose specific nodes during data writing. Please note that these are expert level configurations and procedures, if unsure of their implications, its always recommended to open a support case with us. Refer following Apache URLs to track the progress of region blocks pinning implementation. https://issues.apache.org/jira/browse/HBASE-13021 https://issues.apache.org/jira/browse/HDFS-6133

gsharma · ‎10-21-2016

SYMPTOM : Immediately after exporting HDFS directories via NFS , some of the directories start throwing permission denied errors to authorized users added in Ranger policies. ROOT CAUSE : NFS neither honors Ranger policies nor HDFS ACLs. If a directory has HDFS permission bits such as 000 and access is controlled fully via Ranger, this directory won’t be exported at all. Messages such as below can be seen in NFS gateway logs :- 2016-07-27 17:35:19,071 INFO mount.RpcProgramMountd (RpcProgramMountd.java:mnt(127)) - Path /test1 is not shared. 2016-07-27 17:35:37,297 INFO mount.RpcProgramMountd (RpcProgramMountd.java:mnt(127)) - Path /test2 is not shared. 2016-07-27 17:39:34,581 INFO mount.RpcProgramMountd (RpcProgramMountd.java:mnt(144)) - Giving handle (fileId:12345) to client for export / Even if the directory gets exported due to some available permissions, effective permission bits are only from HDFS and not from Ranger policies.

gsharma · ‎10-21-2016

SYMPTOMS : Errors such as "KeeperErrorCode = NoAuth for /config/topics" ROOT CAUSE : Errors such as above are reported while trying to create or delete topic from an ordinary user because only the process owner of Kafka service such as root can write to zookeeper znodes i.e. /configs/topics.Ranger policies do not get enforced when a non privileged user creates a topic is because kafka-topics.sh script talks directly to zookeeper in order to create the topics. It will add entries into the zookeeper nodes and the watchers on the broker side will monitor and create topics accordingly. Due to this process involving zookeeper, the authorization cannot be done through the ranger plugin. NEXT STEPS : If one would want to allow users to be able to create topics, We have a script called kafka-acls.sh which would help allow or deny users on topics and many such options. The details are elaborated in the document mentioned below :- http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_secure-kafka-ambari/content/ch_secure-kafka-auth-cli.html

gsharma · ‎10-20-2016

Linux ACLs are implemented in such a way that setting default ACLs on parent directory shall automatically get inherited to child directories and umask shall have no influence in this behavior. However HDFS ACLs have slightly different approach here, they do take into account umask set in hdfs-site.xml in parameter "fs.permissions.umask-mode" and enforce ACLs on child folders based on these two parameter with umask taking precedence over the other. Let's try and reproduce this case :- [gaurav@test ~]$ fs -mkdir /tmp/acltest [gaurav@test ~]$ fs -setfacl -m default:mask::rwx /tmp/acltest [gaurav@test ~]$ fs -setfacl -m mask::rwx /tmp/acltest [gaurav@test ~]$ fs -setfacl -m default:user:adam:rwx /tmp/acltest [gaurav@test ~]$ fs -setfacl -m user:adam:rwx /tmp/acltest Let's see what ACLs are implemented :- [gaurav@test~]$ fs -getfacl /tmp/acltest # file: /tmp/acltest # owner: gaurav # group: hdfs user::rwx user:adam:rwx group::r-x mask::rwx other::r-x default:user::rwx default:user:adam:rwx default:group::r-x default:mask::rwx default:other::r-x Let's create a child directory now and see ACLs inherited. [gaurav@test ~]$ fs -mkdir /tmp/acltest/subdir1 [gaurav@test~]$ fs -getfacl /tmp/acltest/subdir1 # file: /tmp/acltest/subdir1 # owner: gaurav # group: hdfs user::rwx user:adam:rwx #effective:r-x group::r-x mask::r-x other::r-x default:user::rwx default:user:adam:rwx default:group::r-x default:mask::rwx In our example, umask was set as 022 and hence effective ACL on child directory turned out to be r-x. REFERENCE: https://issues.apache.org/jira/browse/HDFS-6962

Online	Offline
Last Visited	‎09-24-2020 03:11 PM

Member Since	‎03-01-2016 07:18 AM
Last Visited	‎09-24-2020 03:11 PM
Posts	104
Kudos received	97

Cloudera Community

Namenode fails to start in rolling upgrade from HD...

HBase does not start after enabling Kerberos on cl...

Re: Balancer failing with "No Block has been Moved...

Unable to import data from SAP HANA to Hadoop

Cluster is down due to Kerberos Principals mismatc...

Unable to start Yarn RM , fails with error "Unable...

HBase Region / HFile blocks pinning in HDP

Unable to access Ranger protected directories afte...

Why Kafka operations do not honor Ranger policies ...

UMask vs HDFS default ACLs