Created 07-11-2019 09:24 AM
We have failed to start ambari-metrics-collector.
The following error appeared in hbase-ams-master.log.
I can not find another ERROR, what should I check?
----------------
/var/log/ambari-metrics-collector/hbase-ams-master-host.log
2019-07-11 15: 34: 41,040 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Master not initialized after 200000ms
at org.apache.hadoop.hbase.util.JVMClusterUtil.waitForEvent (JVMClusterUtil.java: 229)
at org.apache.hadoop.hbase.util.JVMClusterUtil.startup (JVMClusterUtil.java: 197)
at org.apache.hadoop.hbase.LocalHBaseCluster.startup (LocalHBaseCluster.java:413)
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster (HMasterCommandLine.java: 232)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run (HMasterCommandLine.java: 140)
at org.apache.hadoop.util.ToolRunner.run (ToolRunner.java: 76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain (ServerCommandLine.java: 149)
at org.apache.hadoop.hbase.master.HMaster.main (HMaster.java:3100)
2019-07-11 15: 34: 41,043 INFO [shutdown-hook-0] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook = true; fsShutdownHook = org.apache.hadoop.fs.FileSystem $ Cache $ ClientFinalizer @ 4a29f290
2019-07-11 15: 34: 41,044 INFO [shutdown-hook-0] regionserver.HRegionServer: ***** STOPPING region server 'areaportal-kvm07, 61320, 1562826676313' *****
2019-07-11 15: 34: 41,044 INFO [shutdown-hook-0] regionserver.HRegionServer: STOPPED: Shutdown hook
Created 07-11-2019 01:06 PM
In your "hbase-ams-master-kvm07log.txt" log we see the following message.
2019-07-11 19:11:58,731 INFO [Thread-23] wal.ProcedureWALFile: Opening file:/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/pv2-00000000000000000001.log length=45336 2019-07-11 19:11:58,743 WARN [Thread-23] wal.WALProcedureStore: Unable to read tracker for file:/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/pv2-00000000000000000001.log org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat$InvalidWALDataException: Invalid Trailer version. got 48 expected 1 at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readTrailer(ProcedureWALFormat.java:189)
Looks like the WAL Data "/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/" got corrupted.
# ls -lart /var/lib/ambari-metrics-collector/hbase/MasterProcWALs/*
May be you can take a backup of the dir "/var/lib/ambari-metrics-collector/hbase/"
and then try to clean the file present inside the "/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/*"
Then try to perform a tmp dir cleanup. After taking a backup of "/var/lib/ambari-metrics-collector/hbase-tmp/" Then
remove the AMS zookeeper data by backing up and removing the contents of 'hbase.tmp.dir'/zookeeper AND any Phoenix spool files from 'hbase.tmp.dir'/phoenix-spool folder
"hbase.tmp.dir": (default value: /var/lib/ambari-metrics-collector/hbase-tmp) This is on local filesystem for both modes:
# rm -fr /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/* # rm -fr /var/lib/ambari-metrics-collector/hbase-tmp/phoenix-spool/*
Then try to restart the AMS.
Better if you also increase the Metrics Collector Heap Size 1024MB and HBase Master Maximum Memory 2048MB. (or 4096MB) if you repeatedly see similar issue.
Created 07-11-2019 09:31 AM
The error snippet which you posted is just the after effect of the actual cause and a very generic message.
Can you please share the following logs for initial review?
/var/log/ambari-metrics-collector/ambari-metrics-collector.log /var/log/ambari-metrics-collector/hbase-ams-master-xxxxxxxx.log /var/log/ambari-metrics-collector/gc.log /var/log/ambari-metrics-collector/collector-gc.log
Also most probably the AMS failure can happen due to incorrect tuning or heavy load. So can you please let us know the following:
1. How many nodes are there in your cluster?
2. How much memory have you allocated to the AMS collector and HMaster.
3. I guess you might be using default Embedded Mode AMS (not distributed) Both require slightly different kind of tuning.
Created 07-11-2019 11:38 AM
I have attached a log file.
The cluster has four nodes. Each node has 32GB of memory.
The memory is specified as follows.
Metrics Collector Heap Size 512MB
HBase Master Maximum Memory 1408 MB
hbase_master_maxperm_size 128MB
HBase Master maximum value for Xmn 1024MB
HBase RegionServer Maximum Memory 768 MB
Created 07-11-2019 01:06 PM
In your "hbase-ams-master-kvm07log.txt" log we see the following message.
2019-07-11 19:11:58,731 INFO [Thread-23] wal.ProcedureWALFile: Opening file:/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/pv2-00000000000000000001.log length=45336 2019-07-11 19:11:58,743 WARN [Thread-23] wal.WALProcedureStore: Unable to read tracker for file:/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/pv2-00000000000000000001.log org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat$InvalidWALDataException: Invalid Trailer version. got 48 expected 1 at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readTrailer(ProcedureWALFormat.java:189)
Looks like the WAL Data "/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/" got corrupted.
# ls -lart /var/lib/ambari-metrics-collector/hbase/MasterProcWALs/*
May be you can take a backup of the dir "/var/lib/ambari-metrics-collector/hbase/"
and then try to clean the file present inside the "/var/lib/ambari-metrics-collector/hbase/MasterProcWALs/*"
Then try to perform a tmp dir cleanup. After taking a backup of "/var/lib/ambari-metrics-collector/hbase-tmp/" Then
remove the AMS zookeeper data by backing up and removing the contents of 'hbase.tmp.dir'/zookeeper AND any Phoenix spool files from 'hbase.tmp.dir'/phoenix-spool folder
"hbase.tmp.dir": (default value: /var/lib/ambari-metrics-collector/hbase-tmp) This is on local filesystem for both modes:
# rm -fr /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/* # rm -fr /var/lib/ambari-metrics-collector/hbase-tmp/phoenix-spool/*
Then try to restart the AMS.
Better if you also increase the Metrics Collector Heap Size 1024MB and HBase Master Maximum Memory 2048MB. (or 4096MB) if you repeatedly see similar issue.
Created 07-11-2019 01:28 PM
Created 07-11-2019 01:46 PM
Good to know that your issue is resolved. It will be great if you can mark this thread as Answered by clicking on the "Accept" button on the helpful answer.
Created 07-14-2019 05:06 PM
The above question was originally posted in the Community Help track. On Sun Jul 14 17:04 UTC 2019, a member of the HCC moderation staff moved it to the Cloud & Operations track. The Community Help Track is intended for questions about using the HCC site itself.