Created on 08-22-2016 07:14 AM - edited 08-19-2019 04:10 AM
Hello,
I receive the following messages from Accumulo every 10 seconds:
monitor_de-hd-cluster.name-node.com.debug.log:
2016-08-22 07:43:14,841 [impl.ThriftScanner] DEBUG: Failed to locate tablet for table : !0 row : ~err_ 2016-08-22 07:43:23,167 [monitor.Monitor] INFO : Failed to obtain problem reports java.lang.RuntimeException: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:161) at org.apache.accumulo.server.problems.ProblemReports$3.hasNext(ProblemReports.java:252) at org.apache.accumulo.server.problems.ProblemReports.summarize(ProblemReports.java:310) at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:346) at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:230) at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80) at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:151) ... 6 more 2016-08-22 07:43:23,510 [impl.ThriftScanner] DEBUG: Failed to locate tablet for table : !0 row : ~err_ 2016-08-22 07:43:26,533 [monitor.Monitor] INFO : Failed to obtain problem reports java.lang.RuntimeException: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:161) at org.apache.accumulo.server.problems.ProblemReports$3.hasNext(ProblemReports.java:252) at org.apache.accumulo.server.problems.ProblemReports.summarize(ProblemReports.java:310) at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:346) at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:230) at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80) at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:151) ... 6 more
After stopping Accumulo the alternating memory usage was gone.
The cluster is not used by anyone and has nothing to do.
Attached all debug log files after a restart of Accumulo.
Could anyone assist?
🙂 Klaus
Created 08-23-2016 02:30 PM
2016-08-2309:19:43,087[recovery.RecoveryManager] DEBUG:Unable to initate log sort for hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9
This error is the reason all of your tables are offline. You are missing a WAL file. Did you delete this file by hand? Is HDFS healthy (not missing any blocks)? If you did not do anything, you can grep for the file name "91ece971-7485-4acf-aa7f-dcde00fafce9" over the Accumulo Master and GarbageCollector log files to see if either of these services reports (incorrectly) deleting this file.
Because this WAL is missing, the accumulo.root table cannot be brought online (because Accumulo knows that it would be missing data). These are system tables, you cannot just delete them.
Your only option as I presently see it is to re-initialize. It appears that you have no other data in the system which I assume makes this a reasonable approach.
First, stop Accumulo via Ambari, and then in a shell as root from the node where the Accumulo Master is installed:
# su - accumulo $ hdfs dfs -rmr /apps/accumulo/data/* $ ACCUMULO_CONF_DIR=/etc/accumulo/conf/secure accumulo init
This will completely remove all Accumulo data and then re-initialize the system. Then, restart Accumulo via Ambari. The system should be in the same state as after your original installation.
Created 08-22-2016 01:53 PM
Hi Klaus,
2016-08-2207:43:14,841[impl.ThriftScanner] DEBUG:Failed to locate tablet for table :!0 row :~err_
This exception is telling you that the Monitor is trying to read the row "~err_" from the accumulo.metadata table but failing to find where the tablet containing that row is hosted. This likely means that accumulo.metadata tablet which contains this row is not assigned to any TabletServer.
2016-08-22 07:43:26,533 [monitor.Monitor] INFO : Failed to obtain problem reports java.lang.RuntimeException: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:161) at org.apache.accumulo.server.problems.ProblemReports$3.hasNext(ProblemReports.java:252) at org.apache.accumulo.server.problems.ProblemReports.summarize(ProblemReports.java:310) at org.apache.accumulo.monitor.Monitor.fetchData(Monitor.java:346) at org.apache.accumulo.monitor.Monitor$1.run(Monitor.java:486) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.accumulo.core.client.impl.ThriftScanner$ScanTimedOutException at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:230) at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80) at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:151) ... 6 more
I believe this timeout is happening because of the previously mentioned failure to locate the tablet.
I would check the Accumulo monitor page and see if you have any unassigned tablets (you should be able to see a red number). There would likely be accompanied logs in the "Recent Logs" page of the monitor which inform you why the Tablet is not being assigned. If you see no obvious errors, I would recommend trying to restart the Accumulo master.
Regarding the "alternating memory usage", I don't have any explanation for that off the top of my head.
Created on 08-23-2016 08:18 AM - edited 08-19-2019 04:10 AM
Hello Josh,
thanks for your quick reply. I thought that the peaks in the memory usage has something to do with table issue.
On the Accumulo monitor page I see now:
In recent logs I see only this warning:
[fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in hdfs-site.xml: data loss is possible on hard system reset or power loss
After a restart I see:
2016-08-23 09:19:30,318 [replication.WorkDriver] DEBUG: Sleeping 30000 ms before next work assignment 2016-08-23 09:19:36,776 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds 2016-08-23 09:19:36,776 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1 2016-08-23 09:19:43,087 [recovery.RecoveryManager] DEBUG: Unable to initate log sort for hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2835) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:733) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:663) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) 2016-08-23 09:19:43,611 [state.ZooTabletStateStore] DEBUG: root tablet logSet [hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9] 2016-08-23 09:19:43,611 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: +r<<@(null,de-hd-cluster.data-node3.com:9997[25669407cc8000b],de-hd-cluster.data-node3.com:9997[25669407cc8000b]) 2016-08-23 09:19:43,611 [recovery.RecoveryManager] DEBUG: Recovering hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9 to hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/recovery/91ece971-7485-4acf-aa7f-dcde00fafce9 2016-08-23 09:19:43,614 [conf.AccumuloConfiguration] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser 2016-08-23 09:19:43,615 [recovery.RecoveryManager] INFO : Starting recovery of hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9 (in : 300s), tablet +r<< holds a reference 2016-08-23 09:19:43,615 [master.Master] DEBUG: [Root Table]: scan time 0.00 seconds 2016-08-23 09:19:43,615 [master.Master] DEBUG: [Root Table] sleeping for 60.00 seconds 2016-08-23 09:19:46,779 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds 2016-08-23 09:19:46,779 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1 2016-08-23 09:19:56,782 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds 2016-08-23 09:19:56,782 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1 2016-08-23 09:20:00,318 [replication.WorkDriver] DEBUG: Sleeping 30000 ms before next work assignment 2016-08-23 09:20:06,785 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds 2016-08-23 09:20:06,785 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1 2016-08-23 09:20:16,788 [master.Master] DEBUG: Finished gathering information from 1 servers in 0.00 seconds 2016-08-23 09:20:16,788 [master.Master] DEBUG: not balancing because there are unhosted tablets: 1
2016-08-23 09:24:44,144 [conf.AccumuloConfiguration] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser 2016-08-23 09:24:44,144 [recovery.RecoveryManager] INFO : Starting recovery of hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9 (in : 300s), tablet +r<< holds a reference
Here the tables in Hadoop:
root@NameNode:~# hadoop fs -ls -R /apps/accumulo/data/tables/ drwxr-xr-x - accumulo hdfs 0 2016-04-19 14:16 /apps/accumulo/data/tables/!0 drwxr-xr-x - accumulo hdfs 0 2016-08-08 13:33 /apps/accumulo/data/tables/!0/default_tablet -rw-r--r-- 3 accumulo hdfs 871 2016-08-08 13:33 /apps/accumulo/data/tables/!0/default_tablet/F0002flt.rf drwxr-xr-x - accumulo hdfs 0 2016-08-10 10:57 /apps/accumulo/data/tables/!0/table_info -rw-r--r-- 3 accumulo hdfs 933 2016-08-08 10:14 /apps/accumulo/data/tables/!0/table_info/A0002bqu.rf -rw-r--r-- 3 accumulo hdfs 933 2016-08-08 10:19 /apps/accumulo/data/tables/!0/table_info/A0002bqx.rf -rw-r--r-- 3 accumulo hdfs 122 2016-08-10 10:57 /apps/accumulo/data/tables/!0/table_info/A004gpfm.rf_tmp -rw-r--r-- 3 accumulo hdfs 688 2016-08-08 13:33 /apps/accumulo/data/tables/!0/table_info/F0002fl0.rf drwxr-xr-x - accumulo hdfs 0 2016-04-19 14:16 /apps/accumulo/data/tables/+r drwxr-xr-x - accumulo hdfs 0 2016-08-10 10:57 /apps/accumulo/data/tables/+r/root_tablet -rw-r--r-- 3 accumulo hdfs 974 2016-08-08 10:19 /apps/accumulo/data/tables/+r/root_tablet/A0002bqz.rf -rw-r--r-- 3 accumulo hdfs 16 2016-08-10 10:57 /apps/accumulo/data/tables/+r/root_tablet/A004gpfl.rf_tmp -rw-r--r-- 3 accumulo hdfs 754 2016-08-10 10:13 /apps/accumulo/data/tables/+r/root_tablet/C004eodm.rf -rw-r--r-- 3 accumulo hdfs 364 2016-08-10 10:18 /apps/accumulo/data/tables/+r/root_tablet/F004ew4v.rf -rw-r--r-- 3 accumulo hdfs 364 2016-08-10 10:29 /apps/accumulo/data/tables/+r/root_tablet/F004fdch.rf -rw-r--r-- 3 accumulo hdfs 364 2016-08-10 10:34 /apps/accumulo/data/tables/+r/root_tablet/F004fn1f.rf -rw-r--r-- 3 accumulo hdfs 364 2016-08-10 10:39 /apps/accumulo/data/tables/+r/root_tablet/F004ftix.rf -rw-r--r-- 3 accumulo hdfs 364 2016-08-10 10:44 /apps/accumulo/data/tables/+r/root_tablet/F004g3af.rf -rw-r--r-- 3 accumulo hdfs 364 2016-08-10 10:54 /apps/accumulo/data/tables/+r/root_tablet/F004glat.rf drwxr-xr-x - accumulo hdfs 0 2016-04-19 14:16 /apps/accumulo/data/tables/+rep drwxr-xr-x - accumulo hdfs 0 2016-04-19 14:16 /apps/accumulo/data/tables/+rep/default_tablet drwxr-xr-x - accumulo hdfs 0 2016-04-19 14:18 /apps/accumulo/data/tables/1 drwxr-xr-x - accumulo hdfs 0 2016-08-10 10:57 /apps/accumulo/data/tables/1/default_tablet -rw-r--r-- 3 accumulo hdfs 2524936 2016-07-23 23:11 /apps/accumulo/data/tables/1/default_tablet/A0002041.rf -rw-r--r-- 3 accumulo hdfs 1502864 2016-07-29 11:17 /apps/accumulo/data/tables/1/default_tablet/C00024ci.rf -rw-r--r-- 3 accumulo hdfs 899175 2016-08-03 18:50 /apps/accumulo/data/tables/1/default_tablet/C00028be.rf -rw-r--r-- 3 accumulo hdfs 1428721 2016-08-07 13:21 /apps/accumulo/data/tables/1/default_tablet/C0002av5.rf -rw-r--r-- 3 accumulo hdfs 211245 2016-08-08 05:11 /apps/accumulo/data/tables/1/default_tablet/C0002bj6.rf -rw-r--r-- 3 accumulo hdfs 30474 2016-08-08 07:42 /apps/accumulo/data/tables/1/default_tablet/C0002bn1.rf -rw-r--r-- 3 accumulo hdfs 50286 2016-08-08 10:03 /apps/accumulo/data/tables/1/default_tablet/C0002bqh.rf -rw-r--r-- 3 accumulo hdfs 122 2016-08-10 10:57 /apps/accumulo/data/tables/1/default_tablet/C004gpfk.rf_tmp -rw-r--r-- 3 accumulo hdfs 905 2016-08-08 13:28 /apps/accumulo/data/tables/1/default_tablet/F0002byb.rf
The command:
root@hdp-accumulo-instance> scan -np -t accumulo.root
hangs.
Do you know how can I get rid of this table?
🙂 Klaus
Created 08-23-2016 11:18 AM
Additional I've done:
tables -l accumulo.metadata => !0 accumulo.replication => +rep accumulo.root => +r trace => 1
CheckTables. Scanning stucks.
/usr/bin/accumulo admin checkTablets 2016-08-23 12:19:18,521 [fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in hdfs-site.xml: data loss is possible on hard system reset or power loss *** Looking for offline tablets *** Scanning zookeeper +r<<@(null,de-hd-cluster.data-node3.com:9997[25669407cc8000b],de-hd-cluster.data-node3.com:9997[25669407cc8000b]) is ASSIGNED_TO_DEAD_SERVER #walogs:1 *** Looking for missing files *** Scanning : accumulo.root (-inf,~ : [] 9223372036854775807 false)
Stats told me
/usr/bin/accumulo org.apache.accumulo.test.GetMasterStats 2016-08-23 11:15:21,623 [fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose set to false in hdfs-site.xml: data loss is possible on hard system reset or power loss State: NORMAL Goal State: NORMAL Unassigned tablets: 1 Dead tablet servers count: 0 Tablet Servers Name: de-hd-cluster.data-node3.com:9997 Ingest: 0.00 Last Contact: 1471943720583 OS Load Average: 0.12 Queries: 0.00 Time Difference: 1.3 Total Records: 0 Lookups: 0 Recoveries: 0
🙂 Klaus
Created 08-23-2016 02:30 PM
2016-08-2309:19:43,087[recovery.RecoveryManager] DEBUG:Unable to initate log sort for hdfs://de-hd-cluster.name-node.com:8020/apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/de-hd-cluster.data-node3.com+9997/91ece971-7485-4acf-aa7f-dcde00fafce9
This error is the reason all of your tables are offline. You are missing a WAL file. Did you delete this file by hand? Is HDFS healthy (not missing any blocks)? If you did not do anything, you can grep for the file name "91ece971-7485-4acf-aa7f-dcde00fafce9" over the Accumulo Master and GarbageCollector log files to see if either of these services reports (incorrectly) deleting this file.
Because this WAL is missing, the accumulo.root table cannot be brought online (because Accumulo knows that it would be missing data). These are system tables, you cannot just delete them.
Your only option as I presently see it is to re-initialize. It appears that you have no other data in the system which I assume makes this a reasonable approach.
First, stop Accumulo via Ambari, and then in a shell as root from the node where the Accumulo Master is installed:
# su - accumulo $ hdfs dfs -rmr /apps/accumulo/data/* $ ACCUMULO_CONF_DIR=/etc/accumulo/conf/secure accumulo init
This will completely remove all Accumulo data and then re-initialize the system. Then, restart Accumulo via Ambari. The system should be in the same state as after your original installation.
Created 08-24-2016 09:39 AM
At my site this will work
ACCUMULO_CONF_DIR=/etc/accumulo/conf/server accumulo init
After init no further issues found.
Many Thanks for your detailed help
🙂 Klaus