Support Questions

mike_bronson7 · ‎01-24-2018

in our ambari cluster , we have a very strange problem

we restart all servers - master01-03

and on each master server we start the services from beginning according to the right order

first on all masters we start the zookeeper server

then on all masters we start the JournalNode

but we notice that on the last master machine - JournalNode restarting evry 10-20 seconds

and on all other machines - JournalNode is stable

please advice why this happend ?

Michael-Bronson

jsensharma · ‎01-24-2018

@Michael Bronson

As you are getting the error:

ERROR namenode.NameNode (NameNode.java:main(1774)) - Failed to start namenode.
java.io.FileNotFoundException: No valid image files found

.

So can you please check of the following directory has any fsimage in it or not? Also if the fsimage file has proper read permission as following or not?

Example:

# ls -l /hadoop/hdfs/namenode/current/fsimage*

-rw-r--r--. 1 hdfs hadoop 195873 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213
-rw-r--r--. 1 hdfs hadoop     62 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213.md5
-rw-r--r--. 1 hdfs hadoop 195873 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519
-rw-r--r--. 1 hdfs hadoop     62 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519.md5

.

View solution in original post

sankar_tumuluru · ‎01-24-2018

@Michael Bronson can you share the log file? It may tell you the root cause.

mike_bronson7 · ‎01-24-2018

any way something here isnt logical , - how we can accept if fsimage_0000000000000000000.md5 was delete them all cluster will be lossed ?

Michael-Bronson

mike_bronson7 · ‎01-24-2018

@JAY what need to verify from user HDFS , if we want to find who remove the files ?

Michael-Bronson

jsensharma · ‎01-24-2018

@Michael Bronson

Can you please share the following log file content so that we can see the cause of failure? The latest log of the JournalNode process where it is keep restarting.

# ls -lart /var/log/hadoop/hdfs/hadoop-hdfs-journalnode*.log
# ls -lart /var/log/hadoop/hdfs/hadoop-hdfs-journalnode*.out

.

mike_bronson7 · ‎01-24-2018

 grep -i error /var/log/hadoop/hdfs/hadoop-hdfs-journalnode-master03.sys57.com.log
2018-01-24 19:05:38,295 ERROR server.JournalNode (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2018-01-24 19:53:09,054 ERROR server.JournalNode (LogAdapter.java:error(69)) - RECEIVED SIGNAL 15: SIGTERM
2018-01-24 21:03:05,331 WARN  ipc.Server (Server.java:processResponse(1273)) - IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.format from :49400 Call#4 Retry#0: output erro

Michael-Bronson

mike_bronson7 · ‎01-24-2018

grep -i war  /var/log/hadoop/hdfs/hadoop-hdfs-journalnode-master03.sys573.com.log
2018-01-24 19:03:41,115 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://master02.sys573.com:6188/ws/v1/timeline/metrics
2018-01-24 19:28:08,819 WARN  mortbay.log (Slf4jLog.java:warn(76)) - Can't reuse /tmp/Jetty_0_0_0_0_8480_journal____.8g4awa, using /tmp/Jetty_0_0_0_0_8480_journal____.8g4awa_661614759039131704
2018-01-24 19:29:18,310 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 19:30:28,393 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 19:56:39,690 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 19:59:38,233 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 20:29:58,228 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 20:37:29,599 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 21:00:28,236 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 21:03:05,331 WARN  ipc.Server (Server.java:processResponse(1273)) - IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.format from :49400 Call#4 Retry#0: output error
2018-01-24 21:16:59,654 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics
2018-01-24 21:21:08,278 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics

Michael-Bronson

sankar_tumuluru · ‎01-24-2018

try to restart the ambari metrics service and check the error goes off

jsensharma · ‎01-24-2018

@Michael Bronson

As we see "null:6188" hostname for the AMS collector in the Journal node output.

WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(349)) - Unable to send metrics to collector by address:http://null:6188/ws/v1/timeline/metrics

So i guess somehow the following file does not have the correct AMS collector host reflected there.

# grep 'journal' /etc/hadoop/conf/hadoop-metrics2.properties
journalnode.sink.timeline.collector.hosts=amshost.example.co

.

Do you have any issue in starting AMS collector as well?

Can you please check the following path on ambari UI

Ambari UI --> HDFS --> Configs --> Advanced --> "Advanced hadoop-metrics2.properties"  --> "hadoop-metrics2.properties template"

Does it has the following section or missing somehow:

datanode.sink.timeline.collector.hosts={{ams_collector_hosts}}
resourcemanager.sink.timeline.collector.hosts={{ams_collector_hosts}}
nodemanager.sink.timeline.collector.hosts={{ams_collector_hosts}}
jobhistoryserver.sink.timeline.collector.hosts={{ams_collector_hosts}}
journalnode.sink.timeline.collector.hosts={{ams_collector_hosts}}
maptask.sink.timeline.collector.hosts={{ams_collector_hosts}}
reducetask.sink.timeline.collector.hosts={{ams_collector_hosts}}
applicationhistoryserver.sink.timeline.collector.hosts={{ams_collector_hosts}}

.

mike_bronson7 · ‎01-24-2018

OK I will , need to wait couple min to see if it restarted as in the previos one

Michael-Bronson

Cloudera Community

Support Questions

JournalNode ( HDFS ) restarting all the time