Support Questions
Find answers, ask questions, and share your expertise

Accumulo keeps crashing with error

Contributor

The thrift server stops responding and Accumulo crashes. The log shows a lot of these error messages but doesn't really point to what the issue is. Anyone familiar with this?

ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Accumulo keeps crashing with error

That exception isn't a direct cause of the server failing -- it's just saying that an RPC failed (it should be suppressed and not logged out). By "thrift server" do you mean TabletServer?

If so, also check the .out/.err files for the process. There may have been some out of memory issue didn't get printed to the log4j file.

View solution in original post

10 REPLIES 10

Re: Accumulo keeps crashing with error

That exception isn't a direct cause of the server failing -- it's just saying that an RPC failed (it should be suppressed and not logged out). By "thrift server" do you mean TabletServer?

If so, also check the .out/.err files for the process. There may have been some out of memory issue didn't get printed to the log4j file.

View solution in original post

Re: Accumulo keeps crashing with error

Re: Accumulo keeps crashing with error

Contributor

Thanks @Josh Elser

Re: Accumulo keeps crashing with error

Contributor

@Josh elser

awesome thanks!

Re: Accumulo keeps crashing with error

Contributor

@Josh Elser

@Artem Ervits

is there any timeline when this bug https://issues.apache.org/jira/browse/ACCUMULO-4059 will be fixed? we are also seeing same error (Tservers getting crashed often)

[server.TThreadPoolServer] ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:190) at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216) ... 11 more

Re: Accumulo keeps crashing with error

@AR

Like I said in my other comment, this exception does not cause the tabletserver to fail. Please collect all logs and .out/.err for the tabletserver after a failure *but before* restarting the process.

Re: Accumulo keeps crashing with error

Contributor

@Josh Elser

22 days back below errors got logged on all TServers and after 22 days Tservers all went down

ERROR: Lost tablet server lock (reason = LOCK_DELETED), exiting

at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException: Peer indicated failure: GSS initiate failed at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:190) at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216) ... 11 more 2016-03-09 20:35:36,971 [tserver.TabletServer] INFO : Master requested tablet server halt

Re: Accumulo keeps crashing with error

Great, that's a very helpful message. The Accumulo Master asked this tabletserver to stop. This happens for one of two reasons:

1. The "hold time" for this server (the amount of time that mutations are being held because of a flush/minor-compaction that is in progress) exceeds a given threshold. By default, this value is 5minutes and is defined by tserver.hold.time.max in accumulo-site.xml

2. The Master periodically asks every tabletserver for a status report. If the Master fails to receive a status report from a TabletServer 3 times in a row, it will request that it shuts down (as it implies that the tabletserver is in a bad state).

Both of these cases will result in a log message written to the Master log file. Please check the Master log file shortly before 2016-03-09 20:35:36,971 to understand which reason it was.

Re: Accumulo keeps crashing with error

Contributor

@Josh Elser

is this related to bug https://issues.apache.org/jira/browse/ACCUMULO-4069 ?

This is pulled from another environment where we have same issue.

Looks like master was unable to receive tablet status report from T server for 3 times,before that it fails to find any Kerberos ticket

from Tserver:

2016-03-29 22:48:53,052 [tserver.TabletServer] [server.TThreadPoolServer] ERROR: Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51) at org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) ...skipping...

2016-03-30 21:56:49,881 [tserver.TabletServer] INFO : Master requested tablet server halt ~

From Master server:

unable to get tablet server status XXXYYYY XXX.com:9997[352d68b0c3801b6] org.apache.thrift.transport.TTransportE xception: GSS initiate failed 2016-03-30 21:56:17,937 [master.Master] ERROR: master:XXXYYYY.XXX.com unable to get tablet server status

From Monitor log:

XXXYYYY1213.fg.XXX.com:9997[152d68b041401b8] org.apache.thrift.transport.TTransportE xception: GSS initiate failed 2016-03-30 21:56:17,938 [master.Master] ERROR: master:XXXYYYY1 unable to get tablet server status

016-03-30 21:56:47,403 [transport.TSaslTransport] ERROR: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.accumulo.core.rpc.UGIAssumingTransport$1.run(UGIAssumingTransport.java:53) at org.apache.accumulo.core.rpc.UGIAssumingTransport$1.run(UGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.accumulo.core.rpc.UGIAssumingTransport.open(UGIAssumingTransport.java:49) at org.apache.accumulo.core.rpc.ThriftUtil.createClientTransport(ThriftUtil.java:298) at org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:478) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:410) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:388) at org.apache.accumulo.core.rpc.ThriftUtil.getClient(ThriftUtil.java:135) at org.apache.accumulo.core.rpc.ThriftUtil.getClientNoTimeout(ThriftUtil.java:102) at org.apache.accumulo.core.client.impl.MasterClient.getConnection(MasterClient.java:69) at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:252) at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486)