Member since
03-01-2016
104
Posts
97
Kudos Received
3
Solutions
12-24-2016
12:56 PM
ENVIRONMENT: HDP 2.5.0 , Ambari 2.4.1
ERRORS:- Logs from Ambari server:-
resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/2.5.3.0-37/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf start namenode -rollingUpgrade started'' returned 1. -bash: line 0: ulimit: core file size: cannot modify limit: Operation not permitted
starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-llab90hdpc2m3.out
ROOT CAUSE:- Not yet known,reported as Internal bug. (BUG-70647)
WORKAROUND:- Add the following entries to /etc/security/limits.conf to complete the upgrade.
soft core unlimited
hard core unlimited
OR as root user, run the following command on Ambari server host.
ulimit -c unlimited
Please note that its not a recommended setting for any of the HDP components and just a workaround to complete the upgrade. Please revert the setting afterwards.In case unsure of the execution/implications of this step ,please raise a support case with HWX to assist further.
... View more
Labels:
12-23-2016
10:48 PM
Environment: HDP 2.4.3 , Ambari 2.4.0 SYMPTOMS: Region server logs are as follows:- 2016-10-03 15:13:55,611 INFO [main] regionserver.HRegionServer: STOPPED: Unexpected exception during initialization, aborting2016-10-03 15:13:55,649 ERROR [main] token.AuthenticationTokenSecretManager: Zookeeper initialization failedorg.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /hbase-secure/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:575)at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:554) Zookeeper logs:- 2016-10-04 15:48:45,702 - ERROR [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:SaslServerCallbackHandler@137] - Failed to set name based on Kerberos authentication rules.
org.apache.zookeeper.server.auth.KerberosName$NoMatchingRule: No rules applied to hbase/345.example.net@EXAMPLE.NET
at org.apache.zookeeper.server.auth.KerberosName.getShortName(KerberosName.java:402)
at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:127)
at org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handle(SaslServerCallbackHandler.java:83)
at com.sun.security.sasl.gsskerb.GssKrb5Server.doHandshake2(GssKrb5Server.java:317) ACL entries in Zookeeper servers:- 123.example.net:2181(CONNECTED) 0] getAcl /hbase-secure
'world,'anyone
: r
'sasl,’hbase/345.example.net@EXAMPLE.NET
: cdrwa
'sasl,'hbase/345.example.net@EXAMPLE.NET
: cdrwa
ROOT CAUSE: Ideally ACLs should not be defined along with hostnames as part of principal as it may cause issues when another node takes role as master or during rolling restart of services. In this case, it was set such a way because of a bug in Ambari (AMBARI-18528) which mangled translation based on zookeeper.security.auth_to_local in zookeeper-env.sh. Please go through this bug to get the required workaround and other details. (adding back slash in front of dollar sign in the respective rule) But why was authentication failing despite a kinit using exactly same principal as defined in Zookeeper ACL ? The answer lies in this setting in zoo.cfg:- kerberos.removeHostFromPrincipal=true
kerberos.removeRealmFromPrincipal=true These two settings ensure that every authenticated principal for zookeeper is stripped off its hostname as well as realm and only a short name is used by Zookeeper server. But tricky part is, this does not apply to setAcl API. SOLUTION: Please note that our regular “rmr” command to delete HBase znode would fail with “Authentication is not valid” errors. Thus we need few alternatives, one such method is this link . Also try using Java system property zookeeper.skipACL=true in zookeeper env.sh file. However if this does not work, we need to delete existing znode through some forceful methods such as stopping HBase and deleting entire zookeeper data directory, however please take this step with utmost caution and only if no other service is dependent on zookeeper. Once the HBase znodes have been deleted, use the workaround given in AMBARI-18528 to populate correct ACL entries and finally start HBase.
... View more
Labels:
09-12-2017
03:24 PM
@gsharma can you please advise on this issue I am having: https://community.hortonworks.com/questions/136870/balancer-no-block-has-been-moved-for-5-iterations.html
... View more
12-23-2016
09:02 AM
1 Kudo
SYMPTOMS: No visible errors in Resource manager / Node Manager logs for any resource bottleneck.Logs from container/task which is not progressing are as follows:- Error: java.io.IOException: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost;
check server and network status [System error: Socket closed] at
org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:173) at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:523) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:791) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException: Connection to database server lost; check server and network status [System error: Socket closed] at
com.sap.db.jdbc.exceptions.jdbc40.SQLNonTransientConnectionException.createException(SQLNonTransientConnectionException.java:40) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:252) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:214) at
com.sap.db.jdbc.exceptions.SQLExceptionSapDB.generateSQLException(SQLExceptionSapDB.java:166) at
com.sap.db.jdbc.exceptions.ConnectionException.createException(ConnectionException.java:22) at
com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:1117) at
com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:877) at
com.sap.db.jdbc.ConnectionSapDB.commitInternal(ConnectionSapDB.java:353) at
com.sap.db.jdbc.ConnectionSapDB.commit(ConnectionSapDB.java:340) at
com.sap.db.jdbc.trace.Connection.commit(Connection.java:126) at
org.apache.sqoop.mapreduce.db.DBRecordReader.close(DBRecordReader.java:169) ... 8 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 ROOT CAUSE: The issue looks to be at SAP HANA side and not at HDP end. Following URL discusses same error - > https://archive.sap.com/discussions/thread/3675080 NEXT STEPS: Contact SAP HANA support team for further troubleshooting.
... View more
Labels:
12-23-2016
07:53 AM
SYMPTOM:
All the services in the cluster are down and restarting the services fails with the following error: 2016-11-17 21:42:18,235 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode.
java.io.IOException: Login failure for nn/lnx21131.examplet.ex.com@EXAMPLE.AD.EX.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Identifier doesn't match expected value (906) Regeneration of Keytabs using Ambari too failed as follows: 17 Nov 2016 23:58:59,136 WARN [Server Action Executor Worker 12702] CreatePrincipalsServerAction:233 - Principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM, does not exist, creating new principal
17 Nov 2016 23:58:59,151 ERROR [Server Action Executor Worker 12702] CreatePrincipalsServerAction:284 - Failed to create or update principal, HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM - Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM
org.apache.ambari.server.serveraction.kerberos.KerberosOperationException: Can not create principal : HTTP/lnx21142.examplet.ex.com@EXAMPLE.AD.EX.COM
Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 - 00002071: UpdErr: DSID-0305038D, problem 6005 (ENTRY_EXISTS), data 0
]; remaining name '"cn=HTTP/lnx21142.examplet.ex.com,OU=Hadoop,OU=EXAMPLE_Users,DC=examplet,DC=ad,DC=ex,DC=com"' ROOT CAUSE:
Wrong entries in all service accounts(VPN) in AD. Character '/' was replaced with '_' by a wrong script. RESOLUTION: Fix the issue in the AD service accounts. In the above case, all '_' was replaced with '/' in the service accounts in AD.
... View more
Labels:
12-22-2016
07:49 PM
PROBLEM: Unable to start Resource Manager which fails with below errors:- STARTUP_MSG: build = git@github.com:hortonworks/hadoop.git -r 9e75108092247d96ce7d70839b6945e9eba2a0b7; compiled by 'jenkins' on 2014-11-04T04:31ZSTARTUP_MSG:
java = 1.7.0_67************************************************************/2014-11-04 08:41:08,705 INFO resourcemanager.ResourceManager (SignalLogger.java:register(91)) - registered UNIX signal handlers for [TERM, HUP, INT]2014-11-04 08:41:10,636 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED;
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to login
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)
Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user
at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user..
2014-11-04 08:41:10,641 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1077)) - Transitioning to standby state
2014-11-04 08:41:10,642 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1087)) - Transitioned to standby state
2014-11-04 08:41:10,643 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1233)) - Error starting ResourceManagerorg.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to loginat org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:211)at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1229)Caused by: java.io.IOException: Login failure for rm/ip-172-31-32-22.ec2.internal@EXAMPLE.COM from keytab /etc/security/keytabs/rm.service.keytab: javax.security.auth.login.LoginException: Unable to obtain password from userat org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.doSecureLogin(ResourceManager.java:1109)at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:209)... 2 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user ROOT CAUSE: This issue is caused because active RM is using user principal of other standby RM and vice versa.This is reported in bug YARN-2805 , HDP bug BUG-26831. The bugs have been resolved now. SOLUTION: If you are on HDP 2.2.0 , raise a support case with HWX to get a Hotfix. WORKAROUND: Hardcode the principal entry “rm/_HOST@EXAMPLE.COM" in Yarn configuration in Ambari by replacing “_HOST” part with actual hostname of active and standby resource manager respectively.
... View more
Labels:
12-22-2016
03:28 PM
Consider increasing network capacity to overcome
the challenge caused due to non locality of blocks. Create configuration groups of datanodes exclusively for HBASE,
disabling HDFS balancer on this group and allow only hbase balancer. Follow
this url Host_Config_Groups to create host config groups. Few temporary workarounds can also be applied if problem is
severe and need immediate attention :- Disable HDFS balancer permanently on the cluster and run it
manually on need basis. (Please spin a support case and have the situation
discussed before implementing this workaround.) In case the performance issue needs to be fixed post running of
HDFS Balancer, a major compaction could be manually initiated. For performance
gains, major compaction is run on off peak hours such as weekends. This article Compaction_Best_Practices is a recommended read here. Scheduling major compaction after scheduled balancer rather than
vice versa. HDFS although has introduced concept of "favored
nodes" feature but HBase APIs are not yet equipped to choose specific
nodes during data writing. Please note that these are expert level configurations and
procedures, if unsure of their implications, its always recommended to open a
support case with us. Refer following Apache URLs to track the progress of region
blocks pinning implementation. https://issues.apache.org/jira/browse/HBASE-13021 https://issues.apache.org/jira/browse/HDFS-6133
... View more
10-21-2016
02:01 PM
SYMPTOM : Immediately after exporting HDFS directories via NFS , some of the directories start throwing permission denied errors to authorized users added in Ranger policies. ROOT CAUSE : NFS neither honors Ranger policies nor HDFS ACLs. If a directory has HDFS permission bits such as 000 and access is controlled fully via Ranger, this directory won’t be exported at all. Messages such as below can be seen in NFS gateway logs :- 2016-07-27 17:35:19,071 INFO mount.RpcProgramMountd (RpcProgramMountd.java:mnt(127)) - Path /test1 is not shared.
2016-07-27 17:35:37,297 INFO mount.RpcProgramMountd (RpcProgramMountd.java:mnt(127)) - Path /test2 is not shared.
2016-07-27 17:39:34,581 INFO mount.RpcProgramMountd (RpcProgramMountd.java:mnt(144)) - Giving handle (fileId:12345) to client for export / Even if the directory gets exported due to some available permissions, effective permission bits are only from HDFS and not from Ranger policies.
... View more
Labels:
10-21-2016
02:01 PM
SYMPTOMS : Errors such as "KeeperErrorCode = NoAuth for /config/topics" ROOT CAUSE : Errors such as above are reported while trying to create or delete topic from an ordinary user because only the process owner of Kafka service such as root can write to zookeeper znodes i.e. /configs/topics.Ranger policies do not get enforced when a non privileged user creates a topic is because kafka-topics.sh script talks directly to zookeeper in order to create the topics. It will add entries into the zookeeper nodes and the watchers on the broker side will monitor and create topics accordingly. Due to this process involving zookeeper, the authorization cannot be done through the ranger plugin. NEXT STEPS : If one would want to allow users to be able to create topics, We have a script called kafka-acls.sh which would help allow or deny users on topics and many such options. The details are elaborated in the document mentioned below :- http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_secure-kafka-ambari/content/ch_secure-kafka-auth-cli.html
... View more
Labels:
10-20-2016
03:30 PM
1 Kudo
Linux ACLs are implemented in such a way that setting default ACLs on parent directory shall automatically get inherited to child directories and umask shall have no influence in this behavior. However HDFS ACLs have slightly different approach here, they do take into account umask set in hdfs-site.xml in parameter "fs.permissions.umask-mode" and enforce ACLs on child folders based on these two parameter with umask taking precedence over the other. Let's try and reproduce this case :- [gaurav@test ~]$ fs -mkdir /tmp/acltest
[gaurav@test ~]$ fs -setfacl -m default:mask::rwx /tmp/acltest
[gaurav@test ~]$ fs -setfacl -m mask::rwx /tmp/acltest
[gaurav@test ~]$ fs -setfacl -m default:user:adam:rwx /tmp/acltest
[gaurav@test ~]$ fs -setfacl -m user:adam:rwx /tmp/acltest Let's see what ACLs are implemented :- [gaurav@test~]$ fs -getfacl /tmp/acltest
# file: /tmp/acltest
# owner: gaurav
# group: hdfs
user::rwx
user:adam:rwx
group::r-x
mask::rwx
other::r-x
default:user::rwx
default:user:adam:rwx
default:group::r-x
default:mask::rwx
default:other::r-x Let's create a child directory now and see ACLs inherited. [gaurav@test ~]$ fs -mkdir /tmp/acltest/subdir1
[gaurav@test~]$ fs -getfacl /tmp/acltest/subdir1
# file: /tmp/acltest/subdir1
# owner: gaurav
# group: hdfs
user::rwx
user:adam:rwx #effective:r-x
group::r-x
mask::r-x
other::r-x
default:user::rwx
default:user:adam:rwx
default:group::r-x
default:mask::rwx In our example, umask was set as 022 and hence effective ACL on child directory turned out to be r-x. REFERENCE: https://issues.apache.org/jira/browse/HDFS-6962
... View more
Labels:
- « Previous
- Next »