About gsharma

gsharma · ‎12-25-2016

SYMPTOM: A create table in a cluster enforcing authorization using Ranger, fails to create the table and post that HiveServer2 process crashes. Create table fails as follows 0: jdbc:hive2://xxxx.hk.example.com> CREATE EXTERNAL TABLE TMP_HIVE2PHOENIX_E32E8 (CUSTOMER_ID STRING, ACCOUNT_ID STRING, ROLE_ID STRING, ROLE_NAME STRING, START_DATE STRING, END_DATE STRING, PRIORITY STRING, ACTIVE_ACCOUNT_ROLE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/tmp/example' TBLPROPERTIES ('serialization.null.format'=''); Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) Following errors are observed in hiveserver2.log: 2016-11-15 11:42:06,721 WARN [HiveServer2-Handler-Pool: Thread-32350]: thrift.ThriftCLIService (ThriftCLIService.java:ExecuteStatement(492)) - Error executing statement: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [1503524] does not have [READ] privilege on [hdfs://hadooprad/tmp/hive2phoenix_e32e8] ... Caused by: org.apache.hadoop.hive.ql.security.authorization.plugin.HiveAccessControlException: Permission denied: user [xxxx] does not have [READ] privilege on [hdfs://hadooprad/tmp/example] at org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizer.checkPrivileges(RangerHiveAuthorizer.java:253) Along with the above errors, hiveserver2.log also shows repetitive GC pauses and subsequently HiveServer2 service crashes: 2016-11-15 12:39:54,428 WARN [org.apache.hadoop.util.JvmPauseMonitor$Monitor@24197b13]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24000ms GC pool 'PS MarkSweep' had collection(s): count=6 time=26445ms ROOT CAUSE: HIVE-10022 / Hortonworks Internal BUG-42569/BUG-67204 To check for a permission (read or write) on a given path of query, Ranger checks for permissions on a given directory and all its children. However, if the directory does not exist, it will try to check the parent directory, or its parent directory, and so on. Eventually the table creation fails and at the same time as this operation uses toom uch memory and causes GC pauses. In this case, Ranger checks for permission on /tmp/<databasename>, and since it does not exist it starts checking /tmp/ and its child directories, causing the GC Pauses and HiveServer2 service crash. RESOLUTION: The fix is not part of the HDP releases currently. Talk to Hortonworks Technical Support and check if a hotfix is possible for the given version. WORKAROUND: Ensure that the Storage Location specified in the create table statement does exist in the system.

gsharma · ‎12-25-2016

SYMPTOM: All the services in the cluster are down and restarting the services fails with the following error: 2016-11-17 21:42:18,235 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode. java.io.IOException: Login failure for nn/lxxx.examplet.ex.com@EXAMPLE.AD.EX.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6) ... Caused by: KrbException: Client not found in Kerberos database (6) ... Caused by: KrbException: Identifier doesn't match expected value (906) Regeneration of Keytabs using Ambari too failed as follows: 17 Nov 2016 23:58:59,136 WARN [Server Action Executor Worker 12702] CreatePrincipalsServerAction:233 - Principal, HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM, does not exist, creating new principal 17 Nov 2016 23:58:59,151 ERROR [Server Action Executor Worker 12702] CreatePrincipalsServerAction:284 - Failed to create or update principal, HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM - Can not create principal : HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM org.apache.ambari.server.serveraction.kerberos.KerberosOperationException: Can not create principal : HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 - 00002071: UpdErr: DSID-0305038D, problem 6005 (ENTRY_EXISTS), data 0 ]; remaining name '"cn=HTTP/lxxx.examplet.ex.com,OU=Hadoop,OU=EXAMPLE_Users,DC=examplet,DC=ad,DC=ex,DC=com"' ROOT CAUSE: Wrong entries in all service accounts(VPN) in AD. Characters '/' was replaced with '_' by a wrong script. RESOLUTION: Fix the issue in the AD service accounts. In the above case, all '_' was replaced with '/' in the service accounts in AD.

gsharma · ‎12-25-2016

SYMPTOMS: Due to both Resource manager getting active simultaneously , all node managers crash. Errors visible in RM logs are as follows: 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService (AbstractService.java:enterState(452)) - Service: Dispatcher entered state STOPPED 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher thread interruptedjava.lang.InterruptedExceptionat java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) atjava.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) While AsyncDispatcher is in hung state, we keep getting below errors:- 2015-06-27 20:08:35,926 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(140)) - AsyncDispatcher is draining to stop, igonring any new events. 2015-06-27 20:08:36,926 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(144)) - Waiting for AsyncDispatcher to drain. Thread state is :WAITING 2015-06-27 20:08:37,927 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(144)) - Waiting for AsyncDispatcher to drain. Thread state is :WAITING ROOT CAUSE: This a known issue reported in YARN-3878 WORKAROUND: Stop one resource manager and start another manually to resume services. REFERENCE: https://issues.apache.org/jira/browse/YARN-3878

gsharma · ‎12-24-2016

SYMPTOMS: Although valid Kerberos ticket is available, we are unable to put files in HDFS encrypted zone. If we restart/failover namenode, then we are able to put files with the same ticket/credentials without having to get a new ticket. Below is the demo of the issue where /tmp/user1 is the encrypted zone and user has permission to that zone: [root@test ~]# su - user Last login: Thu Oct 13 13:03:24 EDT 2016 on pts/57 -bash-4.2$ id uid=11516(user) gid=5000(bns) groups=5000(bns),1520(cmtsuser),1800(admin),4534(edgegrp),4535(edgedgrp),4536(k2tstgrp),8242(ftallocctxd),8243(ftallocctxu),15113(hdpadm) -bash-4.2$ kinit Password for user@123.EXAMPLE.COM: -bash-4.2$ klist Ticket cache: FILE:/tmp/krb5cc_11516 Default principal: user@123.EXAMPLE.COM Valid starting Expires Service principal 10/14/2016 07:23:51 10/14/2016 17:23:51 krbtgt/123.EXAMPLE.COM@EXAMPLE.COM renew until 10/21/2016 07:23:48 -bash-4.2$ hadoop fs -put file1 /tmp/user1/file_1 put: java.util.concurrent.ExecutionException: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) -bash-4.2$ -bash-4.2$ hadoop fs -put file1 /tmp/file_1 -bash-4.2$ hadoop fs -cat /tmp/file_1 diana -bash-4.2$ ROOT CAUSE: Service Delegation Token (DT) renewal was not working because the customer code misses the token renewer class in KMS. After enabling Hadoop KMS, the cluster can only work normally until the configured time in property hadoop.kms.authentication.delegation-token.renew-interval.sec is reached. The config does not exists in customer code, and the default one is 86400 essentially 1 day. SOLUTION: Following options are provided: If the customer plans to upgrade to newer version (e.g. HDP 2.5), the problem does not exist as all fixes will be included. Otherwise a hotfix can be provided for them to include those fixes. Please raise a support case for the same. REFERENCE: https://issues.apache.org/jira/browse/HADOOP-13155

gsharma · ‎12-24-2016

SYMPTOMS: When local disk utilization of multiple node managers goes high beyond a limit, nodes turn “unhealthy” and gets into the "blacklist" not to be used for container/task allocation, hence reducing the effective cluster capacity. ROOT CAUSE: A burst or rapid rate of submitted jobs with substantial NM usercache resource localization footprint may lead to rapid fill up of the NM local temporary file system with negative consequences in terms of stability. The core issue seems to be the fact that NM continues to localize the resources beyond the maximum local cache size (yarn.nodemanager.localizer.cache.target-size-mb , default 10G). Since maximum local cache size is effectively not taken into account when localizing new resources (note that default cache cleanup interval is 10 min controlled by yarn.nodemanager.localizer.cache.cleanup.interval-ms), this basically leads to sort of self-destruction scenario : Once the filesystem utilization reaches the threshold of 90%, NM will automatically de-register from RM, effectively leading to NM outage. This issue may offline many NMs simultaneously at the same time and thus is quite critical in terms of platform stability. SOLUTION: Keep larger/multiple mount points for these local directories. No consensus has been achieved yet in the discussion if HDFS filesystem could be used for these directories. REFERENCE: https://issues.apache.org/jira/browse/YARN-5140

gsharma · ‎12-24-2016

ENVIRONMENT: HDP 2.3.4, Ambari 2.2.1 SYMPTOMS: After creating an encryption zone and attempting to move data in this zone "Authentication Required" errors are reported in the kms-audit.log. 2016-11-15 09:06:40,561 UNAUTHENTICATED RemoteHost:W.X.Y.Z Method:OPTIONS URL:http://hdp02.example.com:9292/kms/v1/keyversion/e1dw_dev_enc_key%400/_eek?eek_op=decrypt&doAs=test ErrorMsg:'Authentication required' The issue reproduces only in ambari-view and not via HDFS commands. Following errors are reported from browser. 500 org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) ROOT CAUSE: For webhdfs to work with TDE, Ranger KMS must be configured to allow hdfs user to access all keys. This is a configuration issue in Ranger KMS. Ambari deploys default configuration where hdfs user is not allowed. This is a known behavior reported in BUG-45012 <property> <name>hadoop.kms.blacklist.DECRYPT_EEK</name> <value>hdfs</value> <description> Blacklist for decrypt EncryptedKey CryptoExtension operations </description> </property> SOLUTION: Upgrade to HDP 2.3.6 WORKAROUND: Since its a security issue, please log a case with HWX support team for any suggestions about possible workarounds.

gsharma · ‎12-24-2016

ROOT CAUSE: YARN UI shows total memory vs used memory wrong when there are reserved resources. The memory total shown when there is no reserved resources will be the correct one. We can also compare that with sum of all NodeManager memory resource. This won't have any impact on YARN scheduler logic. This behavior has been reported in bug in YARN UI and got fixed by https://issues.apache.org/jira/browse/YARN-3432 and https://issues.apache.org/jira/browse/YARN-3243 REFERENCES: https://issues.apache.org/jira/browse/YARN-3432 https://issues.apache.org/jira/browse/YARN-3243

rambabuch · ‎07-27-2018

Here is my solution https://community.hortonworks.com/questions/208928/increase-open-file-limit-of-the-user-to-scale-for.html

76_subhasis · ‎05-13-2018

resolved. For me it was problem with one of the JN

gsharma · ‎12-24-2016

SYMPTOMS: /tmp filling up causes multiple services to stop functioning. ROOT CAUSE: The issue happens due to internal Smartsense bug ST-2551. SOLUTION: Upgrade to Smartsense 1.3.1 WORKAROUND: To workaround this issue we need to manually modify two files related to Smartsense, so that the tmp files will not be generated in /tmp directory anymore 1. File : /usr/hdp/share/hst/hst-agent/lib/hst_agent/anonymize.py Change from : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\ " -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\ " -cp {1} {2} {3}" Change to : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\ " -Djava.io.tmpdir=/grid/02/smartsense/hst-agent/data/tmp" +\ " -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\ " -cp {1} {2} {3}" Make sure the tmp dir value is same as this property agent.tmp_dir in hst-agent-conf. 2. File : /usr/sbin/hst-server.py Change from : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\ "java -server -XX:NewRatio=3 "\ "-XX:+UseConcMarkSweepGC " +\ "-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\ debug_options +\ " -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\ " com.hortonworks.support.tools.server.SupportToolServer "\ ">" + SERVER_OUT_FILE + " 2>&1 &" Change to : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\ "java -server -XX:NewRatio=3 "\ "-XX:+UseConcMarkSweepGC " +\ "-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\ "-Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp " +\ debug_options +\ " -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\ " com.hortonworks.support.tools.server.SupportToolServer "\ ">" + SERVER_OUT_FILE + " 2>&1 &" Make sure the tmp dir value is same as this property server.tmp.dir in hst-server-conf. 3. After above changes, pease clean up existing .pyc files from both of the above directories, and restart Smartsense server and agents for changes to take effect.

Online	Offline
Last Visited	‎09-24-2020 03:11 PM

Member Since	‎03-01-2016 07:18 AM
Last Visited	‎09-24-2020 03:11 PM
Posts	104
Kudos received	97

Cloudera Community

Re: Upgrading the cluster - OS, HDP, resources

Re: Zookeeper not running

Re: How to disable user limits in Yarn Capacity Sc...

Create table causes HiveServer2 to crash

Cluster is down due to Kerberos Principals mismatc...

All NMs crash after both RMs become active

Unable to put files in HDFS Encrypted Zone

Node Managers turns unhealthy due to heavy volume ...

Authentication errors from Ambari view copying dat...

Resource Manager UI shows used memory more than to...

Re: Ulimit settings not respected when service is ...

Re: Namenode Restart fails during HDP Upgrade.

/tmp filling up due to Smartsense files.