Created 03-29-2016 02:56 PM
The thrift server stops responding and Accumulo crashes. The log shows a lot of these error messages but doesn't really point to what the issue is. Anyone familiar with this?
ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException
Created 03-29-2016 03:02 PM
That exception isn't a direct cause of the server failing -- it's just saying that an RPC failed (it should be suppressed and not logged out). By "thrift server" do you mean TabletServer?
If so, also check the .out/.err files for the process. There may have been some out of memory issue didn't get printed to the log4j file.
Created 03-29-2016 03:02 PM
That exception isn't a direct cause of the server failing -- it's just saying that an RPC failed (it should be suppressed and not logged out). By "thrift server" do you mean TabletServer?
If so, also check the .out/.err files for the process. There may have been some out of memory issue didn't get printed to the log4j file.
Created 03-29-2016 03:21 PM
Created 03-29-2016 04:02 PM
Thanks @Josh Elser
Created 04-01-2016 03:16 PM
@Josh elser
awesome thanks!
Created 03-31-2016 06:29 PM
is there any timeline when this bug https://issues.apache.org/jira/browse/ACCUMULO-4059 will be fixed? we are also seeing same error (Tservers getting crashed often)
[server.TThreadPoolServer] ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:190) at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216) ... 11 more
Created 03-31-2016 06:38 PM
Like I said in my other comment, this exception does not cause the tabletserver to fail. Please collect all logs and .out/.err for the tabletserver after a failure *but before* restarting the process.
Created 03-31-2016 06:57 PM
22 days back below errors got logged on all TServers and after 22 days Tservers all went down
ERROR: Lost tablet server lock (reason = LOCK_DELETED), exiting
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:190) at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216) ... 11 more 2016-03-09 20:35:36,971 [tserver.TabletServer] INFO : Master requested tablet server halt
Created 03-31-2016 07:22 PM
Great, that's a very helpful message. The Accumulo Master asked this tabletserver to stop. This happens for one of two reasons:
1. The "hold time" for this server (the amount of time that mutations are being held because of a flush/minor-compaction that is in progress) exceeds a given threshold. By default, this value is 5minutes and is defined by tserver.hold.time.max in accumulo-site.xml
2. The Master periodically asks every tabletserver for a status report. If the Master fails to receive a status report from a TabletServer 3 times in a row, it will request that it shuts down (as it implies that the tabletserver is in a bad state).
Both of these cases will result in a log message written to the Master log file. Please check the Master log file shortly before 2016-03-09 20:35:36,971 to understand which reason it was.
Created 04-01-2016 01:50 AM
is this related to bug https://issues.apache.org/jira/browse/ACCUMULO-4069 ?
This is pulled from another environment where we have same issue.
Looks like master was unable to receive tablet status report from T server for 3 times,before that it fails to find any Kerberos ticket
from Tserver:
2016-03-29 22:48:53,052 [tserver.TabletServer] [server.TThreadPoolServer] ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) ...skipping...
2016-03-30 21:56:49,881 [tserver.TabletServer] INFO : Master requested tablet server halt ~
From Master server:
unable to get tablet server status XXXYYYY XXX.com:9997[352d68b0c3801b6] org.apache.thrift.transport.TTransportE xception: GSS initiate failed 2016-03-30 21:56:17,937 [master.Master] ERROR: master:XXXYYYY.XXX.com unable to get tablet server status
From Monitor log:
XXXYYYY1213.fg.XXX.com:9997[152d68b041401b8] org.apache.thrift.transport.TTransportE xception: GSS initiate failed 2016-03-30 21:56:17,938 [master.Master] ERROR: master:XXXYYYY1 unable to get tablet server status
016-03-30 21:56:47,403 [transport.TSaslTransport] ERROR: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.accumulo.core.rpc.UGIAssumingTransport$1.run(UGIAssumingTransport.java:53) at org.apache.accumulo.core.rpc.UGIAssumingTransport$1.run(UGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.accumulo.core.rpc.UGIAssumingTransport.open(UGIAssumingTransport.java:49) at org.apache.accumulo.core.rpc.ThriftUtil.createClientTransport(ThriftUtil.java:298) at org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:478) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:410) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:388) at org.apache.accumulo.core.rpc.ThriftUtil.getClient(ThriftUtil.java:135) at org.apache.accumulo.core.rpc.ThriftUtil.getClientNoTimeout(ThriftUtil.java:102) at org.apache.accumulo.core.client.impl.MasterClient.getConnection(MasterClient.java:69) at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:252) at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486)