Member since
02-08-2016
12
Posts
10
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2147 | 02-09-2016 09:57 AM |
10-13-2017
08:36 AM
By mistake /var/hadoop/hdfs/data was in my list of DataNode directories. I have removed this in Ambari and restarted all services successfully. I was under the impression that /var/hadoop/hdfs/data/current/BP-* would then be deleted but it is still there and taking up space. Is it safe to delete it by hand?
... View more
Labels:
- Labels:
-
Apache Hadoop
09-19-2017
11:40 AM
Way back when someone set up our Ambari Truststore (ambari-server-truststore.jks) and the Ranger Keystore (ranger-admin-keystore.jks). These file contain certificates for the domain controllers we use. Lately I've been noticing some LDAP queries getting SSL handshake errors. I've tracked this down to the fact that some of the certificates for the DCs in the truststore/keystore have expired. This means I need to update the jks files with the new keys. Is there a script to do this for all of my domain controllers? Something using openssl and then importing importing with keytool? I've searched long and hard but to no avail. This must be a common task, someone must have a script. Any suggestions?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Ranger
06-23-2017
05:05 AM
After testing it has turned out that this Client {
com.sun.security.auth.module.Krb5LoginModule required
debug=false
renewTGT=false
useKeyTab=true
keyTab="/etc/security/keytabs/opentsdb.service.keytab"
principal="opentsdb/host.cluster@XXX.YYY.COM"
useTicketCache=false;
}; is indeed what works. When starting OpenTSDB I see this: 2017-06-22 13:52:57,366 INFO [Thread-1] Login: TGT refresh thread started.
2017-06-22 13:52:57,376 INFO [Thread-1] Login: TGT valid starting at: Thu Jun 22 13:52:57 CEST 2017
2017-06-22 13:52:57,376 INFO [Thread-1] Login: TGT expires: Thu Jun 22 23:52:57 CEST 2017
2017-06-22 13:52:57,376 INFO [Thread-1] Login: TGT refresh sleeping until: Thu Jun 22 22:06:24 CEST 2017 And then it refreshes the TGT when it said it would (22:06): 2017-06-22 22:06:24,667 INFO [Thread-1] Login: Initiating logout for opentsdb/host.cluster@XXX.YYY.COM
2017-06-22 22:06:24,668 INFO [Thread-1] Login: Initiating re-login for opentsdb/host.cluster@XXX.YYY.COM
2017-06-22 22:06:24,677 INFO [Thread-1] Login: TGT valid starting at: Thu Jun 22 22:06:24 CEST 2017
2017-06-22 22:06:24,677 INFO [Thread-1] Login: TGT expires: Fri Jun 23 08:06:24 CEST 2017
2017-06-22 22:06:24,677 INFO [Thread-1] Login: TGT refresh sleeping until: Fri Jun 23 06:19:34 CEST 2017 The other two variations of the JAAS file that I tried ended up complaining because of various kinit errors.
... View more
06-22-2017
07:22 AM
But this is the whole problem, isn't it? OpenTSDB is running as root; hence trying to renew from the /tmp/krb5cc_0 file: Causedby: org.apache.zookeeper.Shell$ExitCodeException: kinit:No credentials cache found (filename:/tmp/krb5cc_0)while renewing credentials But there is no /tmp/krb5cc_0 file since when starting OpenTSDB is just reads from the keytab and never saves the TGT info into a krb5cc_uid file. Because of this the renewal won't work at all. I'm beginning to think I need to use: Client {
com.sun.security.auth.module.Krb5LoginModule required debug=false
renewTGT=false
useKeyTab=true
keyTab="/etc/security/keytabs/opentsdb.service.keytab"
principal="opentsdb/host.cluster@XXX.YYY.COM"
useTicketCache=false;
}; But then I'm not sure what will happen when the TGT expires; is the system clever enough to get the TGT from the keytab again? What does storeKey=true do in your example file?
... View more
06-20-2017
05:55 AM
I was thinking of trying: Client {
com.sun.security.auth.module.Krb5LoginModule required
debug=false
renewTGT=true
useKeyTab=true
storeKey=true
keyTab="/etc/security/keytabs/opentsdb.service.keytab"
principal="opentsdb/host.cluster@XXX.YYY.COM"
useTicketCache=false;
};
Wouldn't this then try to renew the TGT but instead of using the ticket cache it would just use the keytab file instead?
... View more
06-16-2017
11:22 AM
I'm using OpenTSDB in a kerberized cluster. I start OpenTSDB as root using CLASSPATH=$CLASSPATH:/home/applications/opentsdb/conf/ JVMARGS="${JVMARGS} -enableassertions -enablesystemassertions -Djava.security.auth.login.config=/home/applications/opentsdb/conf/opentsdb_client_jaas.conf" /home/applications/opentsdb/opentsdb-2.3.0/build/tsdb tsd --config /home/applications/opentsdb/conf/opentsdb.conf The jaas config file looks like this: Client {
com.sun.security.auth.module.Krb5LoginModule required debug=false
renewTGT=true
useKeyTab=true
keyTab="/etc/security/keytabs/opentsdb.service.keytab"
principal="opentsdb/host.cluster@XXX.YYY.COM"
useTicketCache=true;
}; Everything starts just fine and in the OpenTSDB log file I see: Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #3async.auth.KerberosClientAuthProvider
Client will use GSSAPI as SASL mechanism.
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #3async.auth.KerberosClientAuthProvider
Connecting to hbase/host.cluster@XXX.YYY.COM
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.HBaseClient
Added client for region RegionInfo(table="tsdb", region_name="tsdb,,1497401874292.983451b817366a624c42c20e7c91af67.", stop_key="\x0B\x00\t\xD7S2Q"), which was added to the regions cache. Now we know that RegionClient@785572588(chan=null, #pending_rpcs=0, #batched=0, #rpcs_inflight=0) is hosting 1 region.
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #2async.auth.KerberosClientAuthProvider
Client will use GSSAPI as SASL mechanism.
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #2async.auth.KerberosClientAuthProvider
Connecting to hbase/host.cluster@XXX.YYY.COM
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.HBaseClient
Added client for region RegionInfo(table="tsdb-uid", region_name="tsdb-uid,,1482497591937.0049eec9a851bc64e12ed2a0540192eb.", stop_key=""), which was added to the regions cache. Now we know that RegionClient@599240979(chan=null, #pending_rpcs=0, #batched=0, #rpcs_inflight=0) is hosting 1 region.
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.SecureRpcHelper96
SASL client context established. Negotiated QoP: auth on for: RegionClient@159145664(chan=null, #pending_rpcs=2, #batched=0, #rpcs_inflight=0)
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.RegionClient
Initialized security helper: org.hbase.async.SecureRpcHelper96@4ce85bd8 for region client: RegionClient@159145664(chan=null, #pending_rpcs=2, #batched=0, #rpcs_inflight=0)
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.auth.KerberosClientAuthProvider
Client will use GSSAPI as SASL mechanism.
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.auth.KerberosClientAuthProvider
Connecting to hbase/host.cluster@XXX.YYY.COM
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.auth.Login
Initialized kerberos login context
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.auth.Login
Scheduled ticket renewal in 29266667 ms
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.auth.Login
TGT expires: Fri Jun 16 09:55:12 CEST 2017
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.auth.Login
TGT valid starting at: Thu Jun 15 23:55:12 CEST 2017
Thu Jun 15 23:55:12 GMT+200 2017INFOAsyncHBase I/O Worker #1async.auth.Login
Successfully logged in The TGT is granted for 10 hours. OpenTSDB says that it will try and renew the TGT in a little over 8 hours. When it does try and renew the TGT I see the following: Thu Jun 15 06:26:39 GMT+200 2017ERRORAsyncHBase Timer HBaseClient #1async.auth.Login
Failed to renew ticketjava.lang.RuntimeException: Could not renew TGT due to problem running shell command: '/usr/bin/kinit -R';
at org.hbase.async.auth.Login.refreshTicketCache(Login.java:340) ~[asynchbase-1.7.2.jar:na]
at org.hbase.async.auth.Login.access$100(Login.java:61) ~[asynchbase-1.7.2.jar:na]
at org.hbase.async.auth.Login$TicketRenewalTask.run(Login.java:386) ~[asynchbase-1.7.2.jar:na]
at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:556) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:632) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:369) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.9.4.Final.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
Caused by: org.apache.zookeeper.Shell$ExitCodeException: kinit: No credentials cache found (filename: /tmp/krb5cc_0) while renewing credentials
at org.apache.zookeeper.Shell.runCommand(Shell.java:225) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.Shell.run(Shell.java:152) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.Shell$ShellCommandExecutor.execute(Shell.java:345) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.Shell.execCommand(Shell.java:431) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.Shell.execCommand(Shell.java:414) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.hbase.async.auth.Login.refreshTicketCache(Login.java:338) ~[asynchbase-1.7.2.jar:na]
... 7 common frames omitted This part: Caused by: org.apache.zookeeper.Shell$ExitCodeException: kinit: No credentials cache found (filename: /tmp/krb5cc_0) while renewing credentials leads me to think that it's trying to renew the TGT for the root user and isn't using the OpenTSDB keytab files. How can I get this to work properly? Can I set useTicketCache-false; in the jaas file? I do not have an OpenTSDB user on the cluster, only the service principals exist in the AD.
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Zookeeper
-
Kerberos
-
Security
06-02-2017
11:14 AM
We've done this and changed the HDFS configuration in Ambari to have net.topology.script.file.name=/etc/hadoop/conf/topology_script.py The only problem is that when we restart HDFS this file gets overwritten. How do I stop this behaviour?
... View more
02-09-2016
12:05 PM
5 Kudos
The cluster has 1 management node (Bright Cluster Manager and Ambari server), 2 NameNodes (1 active, 1 passive) and 17 DataNodes and is running Hortonworks HDP 2.3.2 and Ambari 2.1.2. Each node has a 2 10GbE NICs which are bonded together and jumbo frames (MTU=9000) is enabled on the interfaces. There are sporadic NodeManager Web UI alerts in Ambari. For all 17 DataNodes we get connection timeouts throughout the day. These timeouts are not correlated with any sort of load on the system, they happen no matter what. When the connection to port 8042 is successful the connection is is around 5-7ms but when the connections fails I get response times of 5 seconds. Never 3 seconds or 6 seconds, always 5 seconds. For example... [root@XXXX ~]# python2.7 YARN_response.py
Testing response time at http://XXXX:8042
Output is written if http response is > 1 second.Press Ctrl-C to exit!
2016-02-08 07:19:17.877947 Host: XX23:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:19:22.889430 Host: XX25:8042 conntime - 5.0078 seconds, HTTP response - 200
2016-02-08 07:19:48.466520 Host: XX15:8042 conntime - 5.0071 seconds, HTTP response - 200
2016-02-08 07:20:24.423817 Host: XX15:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:20:29.449196 Host: XX23:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:21:00.190991 Host: XX19:8042 conntime - 5.0077 seconds, HTTP response - 200
2016-02-08 07:21:05.210073 Host: XX24:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:21:28.738996 Host: XX17:8042 conntime - 5.0078 seconds, HTTP response - 200
2016-02-08 07:21:33.747728 Host: XX18:8042 conntime - 5.0086 seconds, HTTP response - 200
2016-02-08 07:21:38.764546 Host: XX22:8042 conntime - 5.0075 seconds, HTTP response - 200 If I let the script run long enough then every DataNode will eventually turn up. It turns out that this is a DNS issue and the solution is to put options single-request in /etc/resolv.conf on all nodes. This option is described in the man page as such: single-request (since glibc 2.10)
Sets RES_SNGLKUP in _res.options.Bydefault, glibc performs IPv4andIPv6 lookups in parallel since version 2.9.Some appliance DNS servers cannot handle these queries properly and make the requests time out.This option disables the behavior and makes glibc perform the IPv6andIPv4 requests sequentially (at the cost of some slowdown of the resolving process). Cluster performance is now as expected.
... View more
Labels:
02-09-2016
09:57 AM
1 Kudo
For once I can solve my own problem. 🙂 It turns out that this is a DNS issue and the solution is to put options single-request in /etc/resolv.conf on all nodes. This option is described in the man page as such: single-request (since glibc 2.10)
Sets RES_SNGLKUP in _res.options. By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9. Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). Cluster performance is now as expected.
... View more
02-08-2016
12:01 PM
2 Kudos
I have a fairly new cluster with 1 management node (Bright Cluster Manager and Ambari server), 2 NameNodes (1 active, 1 passive) and 17 DataNodes. We're running Hortonworks HDP 2.3.2 and Ambari 2.1.2. Each node has a 2 10GbE NICs which are bonded together and jumbo frames (MTU=9000) is enabled on the interfaces. From the very beginning of the cluster we have been receiving sporadic NodeManager Web UI alerts in Ambari. For all 17 DataNodes we get connection timeouts throughout the day. These timeouts are not correlated with any sort of load on the system, they happen no matter what. When the connection to port 8042 is successful the connection is is around 5-7ms but when the connections fails I get response times of 5 seconds. Never 3 seconds or 6 seconds, always 5 seconds. For example... <code>[root@XXXX ~]# python2.7 YARN_response.py
Testing response time at http://XXXX:8042
Output is written if http response is > 1 second.
Press Ctrl-C to exit!
2016-02-08 07:19:17.877947 Host: XX23:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:19:22.889430 Host: XX25:8042 conntime - 5.0078 seconds, HTTP response - 200
2016-02-08 07:19:48.466520 Host: XX15:8042 conntime - 5.0071 seconds, HTTP response - 200
2016-02-08 07:20:24.423817 Host: XX15:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:20:29.449196 Host: XX23:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:21:00.190991 Host: XX19:8042 conntime - 5.0077 seconds, HTTP response - 200
2016-02-08 07:21:05.210073 Host: XX24:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:21:28.738996 Host: XX17:8042 conntime - 5.0078 seconds, HTTP response - 200
2016-02-08 07:21:33.747728 Host: XX18:8042 conntime - 5.0086 seconds, HTTP response - 200
2016-02-08 07:21:38.764546 Host: XX22:8042 conntime - 5.0075 seconds, HTTP response - 200
If I let the script run long enough then every DataNode will eventually turn up. Has anyone out there ever seen something like this? Because of the discrete connection time I'm thinking it must be some kind of timeout that is happening. My network team says that the top of rack switches all look good. I've running out of ideas. Any suggestions?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN