Created on 01-24-2018 08:06 PM - edited 08-17-2019 09:46 PM
in our ambari cluster , we have a very strange problem
we restart all servers - master01-03
and on each master server we start the services from beginning according to the right order
first on all masters we start the zookeeper server
then on all masters we start the JournalNode
but we notice that on the last master machine - JournalNode restarting evry 10-20 seconds
and on all other machines - JournalNode is stable
please advice why this happend ?
Created 01-24-2018 09:58 PM
As you are getting the error:
ERROR namenode.NameNode (NameNode.java:main(1774)) - Failed to start namenode. java.io.FileNotFoundException: No valid image files found
.
So can you please check of the following directory has any fsimage in it or not? Also if the fsimage file has proper read permission as following or not?
Example:
# ls -l /hadoop/hdfs/namenode/current/fsimage* -rw-r--r--. 1 hdfs hadoop 195873 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213 -rw-r--r--. 1 hdfs hadoop 62 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213.md5 -rw-r--r--. 1 hdfs hadoop 195873 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519 -rw-r--r--. 1 hdfs hadoop 62 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519.md5
.
Created 01-24-2018 08:21 PM
@Michael Bronson can you share the log file? It may tell you the root cause.
Created 01-24-2018 10:40 PM
any way something here isnt logical , - how we can accept if fsimage_0000000000000000000.md5 was delete them all cluster will be lossed ?
Created 01-24-2018 10:43 PM
@JAY what need to verify from user HDFS , if we want to find who remove the files ?
Created 01-24-2018 09:18 PM
Can you please share the following log file content so that we can see the cause of failure? The latest log of the JournalNode process where it is keep restarting.
# ls -lart /var/log/hadoop/hdfs/hadoop-hdfs-journalnode*.log # ls -lart /var/log/hadoop/hdfs/hadoop-hdfs-journalnode*.out
.
Created 01-24-2018 09:22 PM
grep -i error /var/log/hadoop/hdfs/hadoop-hdfs-journalnode-master03.sys57.com.log 2018-01-24 19:05:38,295 ERROR server.JournalNode (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM 2018-01-24 19:53:09,054 ERROR server.JournalNode (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM 2018-01-24 21:03:05,331 WARN ipc.Server (Server.java:processResponse(1273)) - IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.format from :49400 Call#4 Retry#0: output erro
Created 01-24-2018 09:25 PM
grep -i war /var/log/hadoop/hdfs/hadoop-hdfs-journalnode-master03.sys573.com.log 2018-01-24 19:03:41,115 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://master02.sys573.com:6188/ws/v1/timeline/metrics 2018-01-24 19:28:08,819 WARN mortbay.log (Slf4jLog.java:warn(76)) - Can't reuse /tmp/Jetty_0_0_0_0_8480_journal____.8g4awa, using /tmp/Jetty_0_0_0_0_8480_journal____.8g4awa_661614759039131704 2018-01-24 19:29:18,310 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 19:30:28,393 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 19:56:39,690 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 19:59:38,233 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 20:29:58,228 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 20:37:29,599 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 21:00:28,236 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 21:03:05,331 WARN ipc.Server (Server.java:processResponse(1273)) - IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.format from :49400 Call#4 Retry#0: output error 2018-01-24 21:16:59,654 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics 2018-01-24 21:21:08,278 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
Created 01-24-2018 09:34 PM
try to restart the ambari metrics service and check the error goes off
Created 01-24-2018 09:42 PM
As we see "null:6188" hostname for the AMS collector in the Journal node output.
WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
So i guess somehow the following file does not have the correct AMS collector host reflected there.
# grep 'journal' /etc/hadoop/conf/hadoop-metrics2.properties journalnode.sink.timeline.collector.hosts=amshost.example.co
.
Do you have any issue in starting AMS collector as well?
Can you please check the following path on ambari UI
Ambari UI --> HDFS --> Configs --> Advanced --> "Advanced hadoop-metrics2.properties" --> "hadoop-metrics2.properties template"
Does it has the following section or missing somehow:
datanode.sink.timeline.collector.hosts={{ams_collector_hosts}} resourcemanager.sink.timeline.collector.hosts={{ams_collector_hosts}} nodemanager.sink.timeline.collector.hosts={{ams_collector_hosts}} jobhistoryserver.sink.timeline.collector.hosts={{ams_collector_hosts}} journalnode.sink.timeline.collector.hosts={{ams_collector_hosts}} maptask.sink.timeline.collector.hosts={{ams_collector_hosts}} reducetask.sink.timeline.collector.hosts={{ams_collector_hosts}} applicationhistoryserver.sink.timeline.collector.hosts={{ams_collector_hosts}}
.
Created 01-24-2018 09:38 PM
OK I will , need to wait couple min to see if it restarted as in the previos one