Support Questions

Find answers, ask questions, and share your expertise

Accumulo keeps crashing with error

avatar
Expert Contributor

The thrift server stops responding and Accumulo crashes. The log shows a lot of these error messages but doesn't really point to what the issue is. Anyone familiar with this?

ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException

1 ACCEPTED SOLUTION

avatar
Super Guru

That exception isn't a direct cause of the server failing -- it's just saying that an RPC failed (it should be suppressed and not logged out). By "thrift server" do you mean TabletServer?

If so, also check the .out/.err files for the process. There may have been some out of memory issue didn't get printed to the log4j file.

View solution in original post

10 REPLIES 10

avatar
Super Guru

That exception isn't a direct cause of the server failing -- it's just saying that an RPC failed (it should be suppressed and not logged out). By "thrift server" do you mean TabletServer?

If so, also check the .out/.err files for the process. There may have been some out of memory issue didn't get printed to the log4j file.

avatar
Super Guru

avatar
Expert Contributor

Thanks @Josh Elser

avatar
Expert Contributor

@Josh elser

awesome thanks!

avatar
Expert Contributor

@Josh Elser

@Artem Ervits

is there any timeline when this bug https://issues.apache.org/jira/browse/ACCUMULO-4059 will be fixed? we are also seeing same error (Tservers getting crashed often)

[server.TThreadPoolServer] ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:190) at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216) ... 11 more

avatar
Super Guru
@AR

Like I said in my other comment, this exception does not cause the tabletserver to fail. Please collect all logs and .out/.err for the tabletserver after a failure *but before* restarting the process.

avatar
Expert Contributor

@Josh Elser

22 days back below errors got logged on all TServers and after 22 days Tservers all went down

ERROR: Lost tablet server lock (reason = LOCK_DELETED), exiting

at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:190) at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216) ... 11 more 2016-03-09 20:35:36,971 [tserver.TabletServer] INFO : Master requested tablet server halt

avatar
Super Guru

Great, that's a very helpful message. The Accumulo Master asked this tabletserver to stop. This happens for one of two reasons:

1. The "hold time" for this server (the amount of time that mutations are being held because of a flush/minor-compaction that is in progress) exceeds a given threshold. By default, this value is 5minutes and is defined by tserver.hold.time.max in accumulo-site.xml

2. The Master periodically asks every tabletserver for a status report. If the Master fails to receive a status report from a TabletServer 3 times in a row, it will request that it shuts down (as it implies that the tabletserver is in a bad state).

Both of these cases will result in a log message written to the Master log file. Please check the Master log file shortly before 2016-03-09 20:35:36,971 to understand which reason it was.

avatar
Expert Contributor

@Josh Elser

is this related to bug https://issues.apache.org/jira/browse/ACCUMULO-4069 ?

This is pulled from another environment where we have same issue.

Looks like master was unable to receive tablet status report from T server for 3 times,before that it fails to find any Kerberos ticket

from Tserver:

2016-03-29 22:48:53,052 [tserver.TabletServer] [server.TThreadPoolServer] ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) ...skipping...

2016-03-30 21:56:49,881 [tserver.TabletServer] INFO : Master requested tablet server halt ~

From Master server:

unable to get tablet server status XXXYYYY XXX.com:9997[352d68b0c3801b6] org.apache.thrift.transport.TTransportE xception: GSS initiate failed 2016-03-30 21:56:17,937 [master.Master] ERROR: master:XXXYYYY.XXX.com unable to get tablet server status

From Monitor log:

XXXYYYY1213.fg.XXX.com:9997[152d68b041401b8] org.apache.thrift.transport.TTransportE xception: GSS initiate failed 2016-03-30 21:56:17,938 [master.Master] ERROR: master:XXXYYYY1 unable to get tablet server status

016-03-30 21:56:47,403 [transport.TSaslTransport] ERROR: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.accumulo.core.rpc.UGIAssumingTransport$1.run(UGIAssumingTransport.java:53) at org.apache.accumulo.core.rpc.UGIAssumingTransport$1.run(UGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.accumulo.core.rpc.UGIAssumingTransport.open(UGIAssumingTransport.java:49) at org.apache.accumulo.core.rpc.ThriftUtil.createClientTransport(ThriftUtil.java:298) at org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:478) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:410) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:388) at org.apache.accumulo.core.rpc.ThriftUtil.getClient(ThriftUtil.java:135) at org.apache.accumulo.core.rpc.ThriftUtil.getClientNoTimeout(ThriftUtil.java:102) at org.apache.accumulo.core.client.impl.MasterClient.getConnection(MasterClient.java:69) at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:252) at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486)