Support Questions

Find answers, ask questions, and share your expertise

Failed to locate tablet for table : !0 row : ~err_

avatar
Expert Contributor

Hello,

I receive the following messages from Accumulo every 10 seconds:

monitor_de-hd-cluster.name-node.com.debug.log:

2016-08-22 07:43:14,841 [impl.ThriftScanner] DEBUG:  Failed to locate tablet for table : !0 row : ~err_ 
2016-08-22 07:43:23,167 [monitor.Monitor] INFO :  Failed to obtain problem reports 
java.lang.RuntimeException: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException
    at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:161)
    at org.apache.accumulo.server.problems.ProblemReports$3.hasNext(ProblemReports.java:252)
    at org.apache.accumulo.server.problems.ProblemReports.summarize(ProblemReports.java:310)
    at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:346)
    at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486)
    at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException
    at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:230)
    at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80)
    at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:151)
    ... 6 more
2016-08-22 07:43:23,510 [impl.ThriftScanner] DEBUG:  Failed to locate tablet for table : !0 row : ~err_ 
2016-08-22 07:43:26,533 [monitor.Monitor] INFO :  Failed to obtain problem reports 
java.lang.RuntimeException: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException
    at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:161)
    at org.apache.accumulo.server.problems.ProblemReports$3.hasNext(ProblemReports.java:252)
    at org.apache.accumulo.server.problems.ProblemReports.summarize(ProblemReports.java:310)
    at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:346)
    at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486)
    at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException
    at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:230)
    at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80)
    at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:151)
    ... 6 more

6853-cluster-mu.jpg

After stopping Accumulo the alternating memory usage was gone.

The cluster is not used by anyone and has nothing to do.

Attached all debug log files after a restart of Accumulo.

Could anyone assist?

🙂 Klaus

1 ACCEPTED SOLUTION

avatar
Super Guru
2016-08-2309:19:43,087[recovery.RecoveryManager] DEBUG:Unable to initate log sort for hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9

This error is the reason all of your tables are offline. You are missing a WAL file. Did you delete this file by hand? Is HDFS healthy (not missing any blocks)? If you did not do anything, you can grep for the file name "91ece971-7485-4acf-aa7f-dcde00fafce9" over the Accumulo Master and GarbageCollector log files to see if either of these services reports (incorrectly) deleting this file.

Because this WAL is missing, the accumulo.root table cannot be brought online (because Accumulo knows that it would be missing data). These are system tables, you cannot just delete them.

Your only option as I presently see it is to re-initialize. It appears that you have no other data in the system which I assume makes this a reasonable approach.

First, stop Accumulo via Ambari, and then in a shell as root from the node where the Accumulo Master is installed:

# su - accumulo
$ hdfs dfs -rmr /apps/accumulo/data/*
$ ACCUMULO_CONF_DIR=/etc/accumulo/conf/secure accumulo init

This will completely remove all Accumulo data and then re-initialize the system. Then, restart Accumulo via Ambari. The system should be in the same state as after your original installation.

View solution in original post

5 REPLIES 5

avatar
Super Guru

Hi Klaus,

2016-08-2207:43:14,841[impl.ThriftScanner] DEBUG:Failed to locate tablet for table :!0 row :~err_

This exception is telling you that the Monitor is trying to read the row "~err_" from the accumulo.metadata table but failing to find where the tablet containing that row is hosted. This likely means that accumulo.metadata tablet which contains this row is not assigned to any TabletServer.

2016-08-22 07:43:26,533 [monitor.Monitor] INFO :  Failed to obtain problem reports 
java.lang.RuntimeException: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException
    at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:161)
    at org.apache.accumulo.server.problems.ProblemReports$3.hasNext(ProblemReports.java:252)
    at org.apache.accumulo.server.problems.ProblemReports.summarize(ProblemReports.java:310)
    at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:346)
    at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486)
    at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException
    at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:230)
    at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80)
    at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:151)
    ... 6 more

I believe this timeout is happening because of the previously mentioned failure to locate the tablet.

I would check the Accumulo monitor page and see if you have any unassigned tablets (you should be able to see a red number). There would likely be accompanied logs in the "Recent Logs" page of the monitor which inform you why the Tablet is not being assigned. If you see no obvious errors, I would recommend trying to restart the Accumulo master.

Regarding the "alternating memory usage", I don't have any explanation for that off the top of my head.

avatar
Expert Contributor

Hello Josh,

thanks for your quick reply. I thought that the peaks in the memory usage has something to do with table issue.

On the Accumulo monitor page I see now:

6884-ac1.jpg

6885-ac2.jpg

In recent logs I see only this warning:

[fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in 
hdfs-site.xml: data loss is possible on hard system reset or power loss

After a restart I see:

2016-08-23 09:19:30,318 [replication.WorkDriver] DEBUG: Sleeping 30000 ms before next work assignment
2016-08-23 09:19:36,776 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds
2016-08-23 09:19:36,776 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1
2016-08-23 09:19:43,087 [recovery.RecoveryManager] DEBUG: Unable to initate log sort for hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2835)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:733)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:663)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

2016-08-23 09:19:43,611 [state.ZooTabletStateStore] DEBUG: root tablet logSet [hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9]
2016-08-23 09:19:43,611 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: +r<<@(null,de-hd-cluster.data-node3.com:9997[25669407cc8000b],de-hd-cluster.data-node3.com:9997[25669407cc8000b])
2016-08-23 09:19:43,611 [recovery.RecoveryManager] DEBUG: Recovering hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9 to hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/recovery/91ece971-7485-4acf-aa7f-dcde00fafce9
2016-08-23 09:19:43,614 [conf.AccumuloConfiguration] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser
2016-08-23 09:19:43,615 [recovery.RecoveryManager] INFO : Starting recovery of hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9 (in : 300s), tablet +r<< holds a reference
2016-08-23 09:19:43,615 [master.Master] DEBUG: [Root Table]: scan time 0.00 seconds
2016-08-23 09:19:43,615 [master.Master] DEBUG: [Root Table] sleeping for 60.00 seconds
2016-08-23 09:19:46,779 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds
2016-08-23 09:19:46,779 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1
2016-08-23 09:19:56,782 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds
2016-08-23 09:19:56,782 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1
2016-08-23 09:20:00,318 [replication.WorkDriver] DEBUG: Sleeping 30000 ms before next work assignment
2016-08-23 09:20:06,785 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds
2016-08-23 09:20:06,785 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1
2016-08-23 09:20:16,788 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds
2016-08-23 09:20:16,788 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1
2016-08-23 09:24:44,144 [conf.AccumuloConfiguration] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser
2016-08-23 09:24:44,144 [recovery.RecoveryManager] INFO : Starting recovery of hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9 (in : 300s), tablet +r<< holds a reference

Here the tables in Hadoop:

root@NameNode:~# hadoop fs -ls -R /apps/accumulo/data/tables/
drwxr-xr-x   - accumulo hdfs          0 2016-04-19 14:16 /apps/accumulo/data/tables/!0
drwxr-xr-x   - accumulo hdfs          0 2016-08-08 13:33 /apps/accumulo/data/tables/!0/default_tablet
-rw-r--r--   3 accumulo hdfs        871 2016-08-08 13:33 /apps/accumulo/data/tables/!0/default_tablet/F0002flt.rf
drwxr-xr-x   - accumulo hdfs          0 2016-08-10 10:57 /apps/accumulo/data/tables/!0/table_info
-rw-r--r--   3 accumulo hdfs        933 2016-08-08 10:14 /apps/accumulo/data/tables/!0/table_info/A0002bqu.rf
-rw-r--r--   3 accumulo hdfs        933 2016-08-08 10:19 /apps/accumulo/data/tables/!0/table_info/A0002bqx.rf
-rw-r--r--   3 accumulo hdfs        122 2016-08-10 10:57 /apps/accumulo/data/tables/!0/table_info/A004gpfm.rf_tmp
-rw-r--r--   3 accumulo hdfs        688 2016-08-08 13:33 /apps/accumulo/data/tables/!0/table_info/F0002fl0.rf
drwxr-xr-x   - accumulo hdfs          0 2016-04-19 14:16 /apps/accumulo/data/tables/+r
drwxr-xr-x   - accumulo hdfs          0 2016-08-10 10:57 /apps/accumulo/data/tables/+r/root_tablet
-rw-r--r--   3 accumulo hdfs        974 2016-08-08 10:19 /apps/accumulo/data/tables/+r/root_tablet/A0002bqz.rf
-rw-r--r--   3 accumulo hdfs         16 2016-08-10 10:57 /apps/accumulo/data/tables/+r/root_tablet/A004gpfl.rf_tmp
-rw-r--r--   3 accumulo hdfs        754 2016-08-10 10:13 /apps/accumulo/data/tables/+r/root_tablet/C004eodm.rf
-rw-r--r--   3 accumulo hdfs        364 2016-08-10 10:18 /apps/accumulo/data/tables/+r/root_tablet/F004ew4v.rf
-rw-r--r--   3 accumulo hdfs        364 2016-08-10 10:29 /apps/accumulo/data/tables/+r/root_tablet/F004fdch.rf
-rw-r--r--   3 accumulo hdfs        364 2016-08-10 10:34 /apps/accumulo/data/tables/+r/root_tablet/F004fn1f.rf
-rw-r--r--   3 accumulo hdfs        364 2016-08-10 10:39 /apps/accumulo/data/tables/+r/root_tablet/F004ftix.rf
-rw-r--r--   3 accumulo hdfs        364 2016-08-10 10:44 /apps/accumulo/data/tables/+r/root_tablet/F004g3af.rf
-rw-r--r--   3 accumulo hdfs        364 2016-08-10 10:54 /apps/accumulo/data/tables/+r/root_tablet/F004glat.rf
drwxr-xr-x   - accumulo hdfs          0 2016-04-19 14:16 /apps/accumulo/data/tables/+rep
drwxr-xr-x   - accumulo hdfs          0 2016-04-19 14:16 /apps/accumulo/data/tables/+rep/default_tablet
drwxr-xr-x   - accumulo hdfs          0 2016-04-19 14:18 /apps/accumulo/data/tables/1
drwxr-xr-x   - accumulo hdfs          0 2016-08-10 10:57 /apps/accumulo/data/tables/1/default_tablet
-rw-r--r--   3 accumulo hdfs    2524936 2016-07-23 23:11 /apps/accumulo/data/tables/1/default_tablet/A0002041.rf
-rw-r--r--   3 accumulo hdfs    1502864 2016-07-29 11:17 /apps/accumulo/data/tables/1/default_tablet/C00024ci.rf
-rw-r--r--   3 accumulo hdfs     899175 2016-08-03 18:50 /apps/accumulo/data/tables/1/default_tablet/C00028be.rf
-rw-r--r--   3 accumulo hdfs    1428721 2016-08-07 13:21 /apps/accumulo/data/tables/1/default_tablet/C0002av5.rf
-rw-r--r--   3 accumulo hdfs     211245 2016-08-08 05:11 /apps/accumulo/data/tables/1/default_tablet/C0002bj6.rf
-rw-r--r--   3 accumulo hdfs      30474 2016-08-08 07:42 /apps/accumulo/data/tables/1/default_tablet/C0002bn1.rf
-rw-r--r--   3 accumulo hdfs      50286 2016-08-08 10:03 /apps/accumulo/data/tables/1/default_tablet/C0002bqh.rf
-rw-r--r--   3 accumulo hdfs        122 2016-08-10 10:57 /apps/accumulo/data/tables/1/default_tablet/C004gpfk.rf_tmp
-rw-r--r--   3 accumulo hdfs        905 2016-08-08 13:28 /apps/accumulo/data/tables/1/default_tablet/F0002byb.rf

The command:

root@hdp-accumulo-instance> scan -np -t accumulo.root

hangs.

Do you know how can I get rid of this table?

🙂 Klaus

avatar
Expert Contributor

Additional I've done:

tables -l
accumulo.metadata    =>        !0
accumulo.replication =>      +rep
accumulo.root        =>        +r
trace                =>         1

CheckTables. Scanning stucks.

/usr/bin/accumulo admin checkTablets
2016-08-23 12:19:18,521 [fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in hdfs-site.xml: data loss is possible on hard system reset or power loss

*** Looking for offline tablets ***

Scanning zookeeper
+r<<@(null,de-hd-cluster.data-node3.com:9997[25669407cc8000b],de-hd-cluster.data-node3.com:9997[25669407cc8000b]) is ASSIGNED_TO_DEAD_SERVER  #walogs:1

*** Looking for missing files ***

Scanning : accumulo.root (-inf,~ : [] 9223372036854775807 false)

Stats told me

/usr/bin/accumulo org.apache.accumulo.test.GetMasterStats
2016-08-23 11:15:21,623 [fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in hdfs-site.xml: data loss is possible on hard system reset or power loss
State: NORMAL
Goal State: NORMAL
Unassigned tablets: 1
Dead tablet servers count: 0
Tablet Servers
 Name: de-hd-cluster.data-node3.com:9997
  Ingest: 0.00
  Last Contact: 1471943720583
  OS Load Average: 0.12
  Queries: 0.00
  Time Difference: 1.3
  Total Records: 0
  Lookups: 0
  Recoveries: 0

🙂 Klaus

avatar
Super Guru
2016-08-2309:19:43,087[recovery.RecoveryManager] DEBUG:Unable to initate log sort for hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9

This error is the reason all of your tables are offline. You are missing a WAL file. Did you delete this file by hand? Is HDFS healthy (not missing any blocks)? If you did not do anything, you can grep for the file name "91ece971-7485-4acf-aa7f-dcde00fafce9" over the Accumulo Master and GarbageCollector log files to see if either of these services reports (incorrectly) deleting this file.

Because this WAL is missing, the accumulo.root table cannot be brought online (because Accumulo knows that it would be missing data). These are system tables, you cannot just delete them.

Your only option as I presently see it is to re-initialize. It appears that you have no other data in the system which I assume makes this a reasonable approach.

First, stop Accumulo via Ambari, and then in a shell as root from the node where the Accumulo Master is installed:

# su - accumulo
$ hdfs dfs -rmr /apps/accumulo/data/*
$ ACCUMULO_CONF_DIR=/etc/accumulo/conf/secure accumulo init

This will completely remove all Accumulo data and then re-initialize the system. Then, restart Accumulo via Ambari. The system should be in the same state as after your original installation.

avatar
Expert Contributor

At my site this will work

ACCUMULO_CONF_DIR=/etc/accumulo/conf/server accumulo init

After init no further issues found.

Many Thanks for your detailed help

🙂 Klaus