Member since
03-01-2016
104
Posts
97
Kudos Received
3
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2204 | 06-03-2018 09:22 PM | |
| 33750 | 05-21-2018 10:31 PM | |
| 2902 | 10-19-2016 07:13 AM |
12-25-2016
11:03 AM
SYMPTOM: A create table in a cluster enforcing authorization using Ranger, fails to create the table and post that HiveServer2 process crashes. Create table fails as follows
0: jdbc:hive2://xxxx.hk.example.com> CREATE EXTERNAL TABLE TMP_HIVE2PHOENIX_E32E8 (CUSTOMER_ID STRING, ACCOUNT_ID STRING, ROLE_ID STRING, ROLE_NAME STRING, START_DATE STRING, END_DATE STRING, PRIORITY STRING, ACTIVE_ACCOUNT_ROLE STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/tmp/example'
TBLPROPERTIES ('serialization.null.format'='');
Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
Following errors are observed in hiveserver2.log:
2016-11-15 11:42:06,721 WARN [HiveServer2-Handler-Pool: Thread-32350]: thrift.ThriftCLIService (ThriftCLIService.java:ExecuteStatement(492)) - Error executing statement:
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [1503524] does not have [READ] privilege on [hdfs://hadooprad/tmp/hive2phoenix_e32e8]
...
Caused by: org.apache.hadoop.hive.ql.security.authorization.plugin.HiveAccessControlException: Permission denied: user [xxxx] does not have [READ] privilege on [hdfs://hadooprad/tmp/example]
at org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizer.checkPrivileges(RangerHiveAuthorizer.java:253)
Along with the above errors, hiveserver2.log also shows repetitive GC pauses and subsequently HiveServer2 service crashes:
2016-11-15 12:39:54,428 WARN [org.apache.hadoop.util.JvmPauseMonitor$Monitor@24197b13]: util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24000ms
GC pool 'PS MarkSweep' had collection(s): count=6 time=26445ms
ROOT CAUSE: HIVE-10022 / Hortonworks Internal BUG-42569/BUG-67204 To check for a permission (read or write) on a given path of query, Ranger checks for permissions on a given directory and all its children. However, if the directory does not exist, it will try to check the parent directory, or its parent directory, and so on. Eventually the table creation fails and at the same time as this operation uses toom uch memory and causes GC pauses.
In this case, Ranger checks for permission on /tmp/<databasename>, and since it does not exist it starts checking /tmp/ and its child directories, causing the GC Pauses and HiveServer2 service crash.
RESOLUTION: The fix is not part of the HDP releases currently. Talk to Hortonworks Technical Support and check if a hotfix is possible for the given version.
WORKAROUND: Ensure that the Storage Location specified in the create table statement does exist in the system.
... View more
12-25-2016
10:49 AM
SYMPTOM: All the services in the cluster are down and restarting the services fails with the following error:
2016-11-17 21:42:18,235 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode.
java.io.IOException: Login failure for nn/lxxx.examplet.ex.com@EXAMPLE.AD.EX.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Client not found in Kerberos database (6)
...
Caused by: KrbException: Identifier doesn't match expected value (906)
Regeneration of Keytabs using Ambari too failed as follows:
17 Nov 2016 23:58:59,136 WARN [Server Action Executor Worker 12702] CreatePrincipalsServerAction:233 - Principal, HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM, does not exist, creating new principal
17 Nov 2016 23:58:59,151 ERROR [Server Action Executor Worker 12702] CreatePrincipalsServerAction:284 - Failed to create or update principal, HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM - Can not create principal : HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM
org.apache.ambari.server.serveraction.kerberos.KerberosOperationException: Can not create principal : HTTP/xxx.examplet.ex.com@EXAMPLE.AD.EX.COM
Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 - 00002071: UpdErr: DSID-0305038D, problem 6005 (ENTRY_EXISTS), data 0
]; remaining name '"cn=HTTP/lxxx.examplet.ex.com,OU=Hadoop,OU=EXAMPLE_Users,DC=examplet,DC=ad,DC=ex,DC=com"'
ROOT CAUSE: Wrong entries in all service accounts(VPN) in AD. Characters '/' was replaced with '_' by a wrong script.
RESOLUTION: Fix the issue in the AD service accounts. In the above case, all '_' was replaced with '/' in the service accounts in AD.
... View more
Labels:
12-25-2016
10:30 AM
1 Kudo
SYMPTOMS: Due to both Resource manager getting active simultaneously , all node managers crash. Errors visible in RM logs are as follows: 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService (AbstractService.java:enterState(452)) - Service: Dispatcher entered state STOPPED
2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher thread interruptedjava.lang.InterruptedExceptionat java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
atjava.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838) While AsyncDispatcher is in hung state, we keep getting below errors:- 2015-06-27 20:08:35,926 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(140)) - AsyncDispatcher is draining to stop, igonring any new events.
2015-06-27 20:08:36,926 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(144)) - Waiting for AsyncDispatcher to drain. Thread state is :WAITING
2015-06-27 20:08:37,927 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:serviceStop(144)) - Waiting for AsyncDispatcher to drain. Thread state is :WAITING ROOT CAUSE: This a known issue reported in YARN-3878 WORKAROUND: Stop one resource manager and start another manually to resume services. REFERENCE: https://issues.apache.org/jira/browse/YARN-3878
... View more
Labels:
12-24-2016
10:21 PM
1 Kudo
SYMPTOMS: Although valid Kerberos ticket is available, we are unable to put files in HDFS encrypted zone. If we restart/failover namenode, then we are able to put files with the same ticket/credentials without having to get a new ticket. Below is the demo of the issue where /tmp/user1 is the encrypted zone and user has permission to that zone:
[root@test ~]# su - user Last login: Thu Oct 13 13:03:24 EDT 2016 on pts/57 -bash-4.2$ id uid=11516(user) gid=5000(bns) groups=5000(bns),1520(cmtsuser),1800(admin),4534(edgegrp),4535(edgedgrp),4536(k2tstgrp),8242(ftallocctxd),8243(ftallocctxu),15113(hdpadm)
-bash-4.2$ kinit Password for user@123.EXAMPLE.COM:
-bash-4.2$ klist
Ticket cache: FILE:/tmp/krb5cc_11516
Default principal: user@123.EXAMPLE.COM
Valid starting Expires Service principal
10/14/2016 07:23:51 10/14/2016 17:23:51 krbtgt/123.EXAMPLE.COM@EXAMPLE.COM
renew until 10/21/2016 07:23:48
-bash-4.2$ hadoop fs -put file1 /tmp/user1/file_1
put: java.util.concurrent.ExecutionException: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
-bash-4.2$
-bash-4.2$ hadoop fs -put file1 /tmp/file_1
-bash-4.2$ hadoop fs -cat /tmp/file_1
diana
-bash-4.2$
ROOT CAUSE: Service Delegation Token (DT) renewal was not working because the customer code misses the token renewer class in KMS. After enabling Hadoop KMS, the cluster can only work normally until the configured time in property hadoop.kms.authentication.delegation-token.renew-interval.sec is reached. The config does not exists in customer code, and the default one is 86400 essentially 1 day.
SOLUTION: Following options are provided:
If the customer plans to upgrade to newer version (e.g. HDP 2.5), the problem does not exist as all fixes will be included.
Otherwise a hotfix can be provided for them to include those fixes. Please raise a support case for the same.
REFERENCE:
https://issues.apache.org/jira/browse/HADOOP-13155
... View more
Labels:
12-24-2016
09:40 PM
2 Kudos
SYMPTOMS: When local disk utilization of multiple node managers goes high beyond a limit, nodes turn “unhealthy” and gets into the "blacklist" not to be used for container/task allocation, hence reducing the effective cluster capacity. ROOT CAUSE: A burst or rapid rate of submitted jobs with substantial NM usercache resource localization footprint may lead to rapid fill up of the NM local temporary file system with negative consequences in terms of stability. The core issue seems to be the fact that NM continues to localize the resources beyond the maximum local cache size (yarn.nodemanager.localizer.cache.target-size-mb , default 10G). Since maximum local cache size is effectively not taken into account when localizing new resources (note that default cache cleanup interval is 10 min controlled by yarn.nodemanager.localizer.cache.cleanup.interval-ms), this basically leads to sort of self-destruction scenario : Once the filesystem utilization reaches the threshold of 90%, NM will automatically de-register from RM, effectively leading to NM outage. This issue may offline many NMs simultaneously at the same time and thus is quite critical in terms of platform stability. SOLUTION: Keep larger/multiple mount points for these local directories. No consensus has been achieved yet in the discussion if HDFS filesystem could be used for these directories. REFERENCE: https://issues.apache.org/jira/browse/YARN-5140
... View more
Labels:
12-24-2016
05:23 PM
2 Kudos
ENVIRONMENT: HDP 2.3.4, Ambari 2.2.1 SYMPTOMS: After creating an encryption zone and attempting to move data in this zone "Authentication Required" errors are reported in the kms-audit.log. 2016-11-15 09:06:40,561 UNAUTHENTICATED RemoteHost:W.X.Y.Z
Method:OPTIONS URL:http://hdp02.example.com:9292/kms/v1/keyversion/e1dw_dev_enc_key%400/_eek?eek_op=decrypt&doAs=test ErrorMsg:'Authentication required' The issue reproduces only in ambari-view and not via HDFS commands. Following errors are reported from browser. 500 org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) ROOT CAUSE: For webhdfs to work with TDE, Ranger KMS must be configured to allow hdfs user to access all keys. This is a configuration issue in Ranger KMS. Ambari deploys default configuration where hdfs user is not allowed. This is a known behavior reported in BUG-45012 <property>
<name>hadoop.kms.blacklist.DECRYPT_EEK</name>
<value>hdfs</value>
<description>
Blacklist for decrypt EncryptedKey
CryptoExtension operations
</description>
</property> SOLUTION: Upgrade to HDP 2.3.6 WORKAROUND: Since its a security issue, please log a case with HWX support team for any suggestions about possible workarounds.
... View more
12-24-2016
04:31 PM
ROOT CAUSE: YARN UI shows total memory vs used memory wrong when there are reserved resources. The memory total shown when there is no reserved resources will be the correct one. We can also compare that with sum of all NodeManager memory resource. This won't have any impact on YARN scheduler logic. This behavior has been reported in bug in YARN UI and got fixed by https://issues.apache.org/jira/browse/YARN-3432 and https://issues.apache.org/jira/browse/YARN-3243
REFERENCES: https://issues.apache.org/jira/browse/YARN-3432 https://issues.apache.org/jira/browse/YARN-3243
... View more
Labels:
07-27-2018
09:41 AM
Here is my solution https://community.hortonworks.com/questions/208928/increase-open-file-limit-of-the-user-to-scale-for.html
... View more
05-13-2018
06:46 PM
resolved. For me it was problem with one of the JN
... View more
12-24-2016
02:36 PM
SYMPTOMS: /tmp filling up causes multiple services to stop functioning. ROOT CAUSE: The issue happens due to internal Smartsense bug ST-2551. SOLUTION: Upgrade to Smartsense 1.3.1 WORKAROUND: To workaround this issue we need to manually modify two files related to Smartsense, so that the tmp files will not be generated in /tmp directory anymore 1. File : /usr/hdp/share/hst/hst-agent/lib/hst_agent/anonymize.py Change from : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\
" -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\ " -cp {1} {2} {3}" Change to : ANONYMIZATION_JAVA_COMMAND = "{0}" + os.sep + "bin" + os.sep + "java" +\
" -Djava.io.tmpdir=/grid/02/smartsense/hst-agent/data/tmp" +\
" -Dlog.file.name="+ ANONYMIZATION_LOG_FILE_NAME +\
" -cp {1} {2} {3}" Make sure the tmp dir value is same as this property agent.tmp_dir in hst-agent-conf. 2. File : /usr/sbin/hst-server.py Change from : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\ "java -server -XX:NewRatio=3 "\
"-XX:+UseConcMarkSweepGC " +\
"-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\
debug_options +\
" -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\
" com.hortonworks.support.tools.server.SupportToolServer "\
">" + SERVER_OUT_FILE + " 2>&1 &" Change to : SERVER_START_CMD = "{0}" + os.sep + "bin" + os.sep +\
"java -server -XX:NewRatio=3 "\ "-XX:+UseConcMarkSweepGC " +\
"-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60 " +\
"-Djava.io.tmpdir=/var/lib/smartsense/hst-server/tmp " +\
debug_options +\
" -Dlog.file.name="+ SERVER_LOG_FILE_NAME +" -Xms512m -Xmx2048m -cp {1}" + os.pathsep + "{2}" +\
" com.hortonworks.support.tools.server.SupportToolServer "\
">" + SERVER_OUT_FILE + " 2>&1 &" Make sure the tmp dir value is same as this property server.tmp.dir in hst-server-conf. 3. After above changes, pease clean up existing .pyc files from both of the above directories, and restart Smartsense server and agents for changes to take effect.
... View more
Labels: