Support Questions

Find answers, ask questions, and share your expertise

JournalNode ( HDFS ) restarting all the time

avatar

in our ambari cluster , we have a very strange problem

we restart all servers - master01-03

and on each master server we start the services from beginning according to the right order

first on all masters we start the zookeeper server

then on all masters we start the JournalNode

but we notice that on the last master machine - JournalNode restarting evry 10-20 seconds


and on all other machines - JournalNode is stable

please advice why this happend ?

58382-capture.png

58381-capture.png

Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Master Mentor

@Michael Bronson

As you are getting the error:

ERROR namenode.NameNode (NameNode.java:main(1774)) - Failed to start namenode.
java.io.FileNotFoundException: No valid image files found

.

So can you please check of the following directory has any fsimage in it or not? Also if the fsimage file has proper read permission as following or not?

Example:

# ls -l /hadoop/hdfs/namenode/current/fsimage*

-rw-r--r--. 1 hdfs hadoop 195873 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213
-rw-r--r--. 1 hdfs hadoop     62 Jan 22 20:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002711213.md5
-rw-r--r--. 1 hdfs hadoop 195873 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519
-rw-r--r--. 1 hdfs hadoop     62 Jan 23 02:05 /hadoop/hdfs/namenode/current/fsimage_0000000000002718519.md5

.

View solution in original post

24 REPLIES 24

avatar
Master Mentor

@Michael Bronson

You can grep for 'fsimage" or the "current" word so that if there is any entry for file deletion then it might be logged in there.

.

Also as it looks like your "fsimage" file is missing so i do not have any option right now to fix this issue until we have fsimage backup stored somewhere which we can restore. (Else we are in Data Loss Situation)

.

Is there any Mount point Or Filesystem issue which is causing the particular fsimage partition to go away (disappear) may be you can check with some storage guy to findout what happened to that file?

avatar

@Jay is it possible to find other file fsimage from other cluster and use it on the problematic cluser ?

Michael-Bronson

avatar
Master Mentor

@Michael Bronson

The fsimage file of one cluster can not be used on another cluster.

avatar

so I am really not understand who can delete them , only root can delete them of maybe from hdfs user ? any chioce to recover them ?

Michael-Bronson

avatar

from root , I do this

grep fsimage /var/log/audit/audit.log

but no results!
Michael-Bronson