Support Questions

Find answers, ask questions, and share your expertise

NameNode keeps going down

avatar

Hi all,

I am having a problem with the NameNode status ambari shows. The following points are verifiable in the system: - The NameNode keeps going down a few seconds after I start it through ambari (it looks like it never really goes up, but the start process run successfully);

- Despite being DOWN according to ambari, if I run JPS in the server the NameNode is hosted it shows that the service is running:

[hdfs@RHTPINEC008 ~]$ jps
39395 NameNode
4463 Jps

and I can access NameNode UI properly;

- I already restarted both the namenode and ambari-agent the manually but the behavior keeps the same;

- This problem started after some HBase/Phoenix heavy queries that caused the namenode to go down (not sure if this is actually related but the exact same configurations were working well before this episode);

- I've been digging for some hours and I am not being able to find error details in the namenode logs nor in the ambari-agent logs that allows me to understand the problem;

I am using hdp 2.4.0 and no HA options.

Can someone help in this?

Thanks in advance

28 REPLIES 28

avatar

Can you please do

ps -ef | grep namenode

On the cluster, and see what all processes comes back. It looks like there is a Namenode process already running, and when you try to start that again it fails to start another one (which is the correct behavior).

I will recommend to stop all the processes returned by the above command, and then restarting the Namenode again.

avatar

Hi Namit,

Thank you for your answer.

Yes, I can run the command:

[nosuser@RHTPINEC008 ~]$ ps -ef | grep namenode
nosuser   7201  6867  0 16:01 pts/0    00:00:00 grep --color=auto namenode
hdfs     39395     1  5 May31 ?        04:01:49 /usr/jdk64/jdk1.8.0_60/bin/java -Dproc_namenode -Xmx1024m -Dhdp.version=2.4.0.0-169 -Djava.net.preferIPv4Stack=true -Dhdp.version= -Djava.net.preferIPv4Stack=true -Dhdp.version= -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop/hdfs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.4.0.0-169/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/2.4.0.0-169/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.0.0-169/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhdp.version=2.4.0.0-169 -Dhadoop.log.dir=/var/log/hadoop/hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-RHTPINEC008.corporativo.pt.log -Dhadoop.home.dir=/usr/hdp/2.4.0.0-169/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.4.0.0-169/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.0.0-169/hadoop/lib/native:/usr/hdp/2.4.0.0-169/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.4.0.0-169/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=512m -XX:MaxNewSize=512m -Xloggc:/var/log/hadoop/hdfs/gc.log-201705311529 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms4096m -Xmx4096m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=512m -XX:MaxNewSize=512m -Xloggc:/var/log/hadoop/hdfs/gc.log-201705311529 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms4096m -Xmx4096m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=512m -XX:MaxNewSize=512m -Xloggc:/var/log/hadoop/hdfs/gc.log-201705311529 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms4096m -Xmx4096m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode

As you suggested I killed the process :

[nosuser@RHTPINEC008 ~]$ sudo kill -9 39395

and started it again through Ambari, which took a while but ended successfuly:

15970-namenode-start-issue.png

A few seconds later the NameNode went down again in the Ambari interface, however I am still able to run:

[hdfs@RHTPINEC008 ~]$ jps
13494 Jps
9832 NameNode

Any ideas?

Could it be the ambari server or agent having problems collecting namenode status?

Thanks


namenode-start-issue.png

avatar

@Geoffrey Shelton Okot Me too getting same error. Could you please suggest?

avatar
Master Mentor

@Subramanian Govindasamy

Can you share the NameNode error log?

avatar

@Geoffrey Shelton Okot

Services running in the server but from ambari , it shows and GC logs show following errors. Could you please check?

2018-05-04T06:39:20.038-0400: 130.745: [GC (Allocation Failure) 2018-05-04T06:39:20.038-0400: 130.745: [ParNew: 152348K->17472K(157248K), 0.0294015 secs] 152348K->30350K(506816K), 0.0294737 secs] [Times: user=0.15 sys=0.03, real=0.03 secs]

avatar

@Geoffrey Shelton Okot

While starting the namenode with wedhdfs enabled, getting following errors

File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 250, in _run_command
raise WebHDFSCallException(err_msg, result_dict)
resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X GET 'http://node1.test.co:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=thdfs@test.co'' returned status_code=400.
{
"RemoteException": {
"exception": "IllegalArgumentException",
"javaClassName": "java.lang.IllegalArgumentException",
"message": "Invalid value for webhdfs parameter \"user.name\": Invalid value: \"thdfs@test.co\" does not belong to the domain ^[A-Za-z_][A-Za-z0-9._-]*[$]?$"
}
}

avatar
Master Mentor

@Subramanian Govindasamy

Seem you have problems with your Auth-to-local Rules please validate?

""message": "Invalid value for webhdfs parameter"

The conclusion is: the username used with the query is checked against a regular expression and, if not validated, the above exception is returned. The default regular expression being:

^[A-Za-z_][A-Za-z0-9._-]*[$]?$

Can you start the namenode manually,

su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start namenode"

Please revert

avatar

@Geoffrey Shelton Okot

Thank you . Let me validate the rules.

while starting the namenode manually ,please find the log

su thdfs@test.co -l -s /bin/bash -c 'ulimit -c unlimited ; /usr/hdp/2.6.4.0-91/hadoop/sbin/hadoop-daemon.sh --config /usr/hdp/2.6.4.0-91/hadoop/conf start namenode'
starting namenode, logging to /var/log/hadoop/thdfs@test.co/hadoop-thdfs@test.co-namenode-node1.test.co.out
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0

ulimit -a for user thdfs@test.co
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127967
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

avatar

@Geoffrey Shelton Okot

also services going to installed state automatically after startup. Could you please guide me ?

service component DATANODE of service HDFS of cluster TSTHDPCLST has changed from STARTED to INSTALLED at host test.co according to STATUS_COMMAND report