Support Questions

Find answers, ask questions, and share your expertise

JournalNode ( HDFS ) restarting all the time

avatar

in our ambari cluster , we have a very strange problem

we restart all servers - master01-03

and on each master server we start the services from beginning according to the right order

first on all masters we start the zookeeper server

then on all masters we start the JournalNode

but we notice that on the last master machine - JournalNode restarting evry 10-20 seconds


and on all other machines - JournalNode is stable

please advice why this happend ?

58382-capture.png

58381-capture.png

Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Master Mentor

@Michael Bronson

As you are getting the error:

ERROR namenode.NameNode (NameNode.java:main(1774)) - Failed to start namenode.
java.io.FileNotFoundException: No valid image files found

.

So can you please check of the following directory has any fsimage in it or not? Also if the fsimage file has proper read permission as following or not?

Example:

# ls -l /hadoop/hdfs/namenode/current/fsimage*

-rw-r--r--. 1 hdfs hadoop 195873 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213
-rw-r--r--. 1 hdfs hadoop     62 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213.md5
-rw-r--r--. 1 hdfs hadoop 195873 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519
-rw-r--r--. 1 hdfs hadoop     62 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519.md5

.

View solution in original post

24 REPLIES 24

avatar
Contributor

@Michael Bronson can you share the log file? It may tell you the root cause.

avatar

any way something here isnt logical , - how we can accept if fsimage_0000000000000000000.md5 was delete them all cluster will be lossed ?

Michael-Bronson

avatar

@JAY what need to verify from user HDFS , if we want to find who remove the files ?

Michael-Bronson

avatar
Master Mentor

@Michael Bronson

Can you please share the following log file content so that we can see the cause of failure? The latest log of the JournalNode process where it is keep restarting.

# ls -lart /var/log/hadoop/hdfs/hadoop-hdfs-journalnode*.log
# ls -lart /var/log/hadoop/hdfs/hadoop-hdfs-journalnode*.out

.

avatar
 grep -i error /var/log/hadoop/hdfs/hadoop-hdfs-journalnode-master03.sys57.com.log
2018-01-24 19:05:38,295 ERROR server.JournalNode (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2018-01-24 19:53:09,054 ERROR server.JournalNode (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2018-01-24 21:03:05,331 WARN  ipc.Server (Server.java:processResponse(1273)) - IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.format from :49400 Call#4 Retry#0: output erro
Michael-Bronson

avatar
grep -i war  /var/log/hadoop/hdfs/hadoop-hdfs-journalnode-master03.sys573.com.log
2018-01-24 19:03:41,115 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://master02.sys573.com:6188/ws/v1/timeline/metrics
2018-01-24 19:28:08,819 WARN  mortbay.log (Slf4jLog.java:warn(76)) - Can't reuse /tmp/Jetty_0_0_0_0_8480_journal____.8g4awa, using /tmp/Jetty_0_0_0_0_8480_journal____.8g4awa_661614759039131704
2018-01-24 19:29:18,310 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 19:30:28,393 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 19:56:39,690 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 19:59:38,233 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 20:29:58,228 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 20:37:29,599 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 21:00:28,236 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 21:03:05,331 WARN  ipc.Server (Server.java:processResponse(1273)) - IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.format from :49400 Call#4 Retry#0: output error
2018-01-24 21:16:59,654 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 21:21:08,278 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
Michael-Bronson

avatar
Contributor

try to restart the ambari metrics service and check the error goes off

avatar
Master Mentor
@Michael Bronson

As we see "null:6188" hostname for the AMS collector in the Journal node output.

WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics

So i guess somehow the following file does not have the correct AMS collector host reflected there.

# grep 'journal' /etc/hadoop/conf/hadoop-metrics2.properties
journalnode.sink.timeline.collector.hosts=amshost.example.co

.

Do you have any issue in starting AMS collector as well?

Can you please check the following path on ambari UI

Ambari UI --> HDFS --> Configs --> Advanced --> "Advanced hadoop-metrics2.properties"  --> "hadoop-metrics2.properties template"


Does it has the following section or missing somehow:

datanode.sink.timeline.collector.hosts={{ams_collector_hosts}}
resourcemanager.sink.timeline.collector.hosts={{ams_collector_hosts}}
nodemanager.sink.timeline.collector.hosts={{ams_collector_hosts}}
jobhistoryserver.sink.timeline.collector.hosts={{ams_collector_hosts}}
journalnode.sink.timeline.collector.hosts={{ams_collector_hosts}}
maptask.sink.timeline.collector.hosts={{ams_collector_hosts}}
reducetask.sink.timeline.collector.hosts={{ams_collector_hosts}}
applicationhistoryserver.sink.timeline.collector.hosts={{ams_collector_hosts}}

.

avatar

OK I will , need to wait couple min to see if it restarted as in the previos one

Michael-Bronson