Support Questions

jan11 · ‎11-19-2016

Hello,

At my new work I inherited a Hadoop (2.4.1) cluster with dead Secondary NameNode since February.

AFAIK I need to resync image by launching SecondaryNameNode manually:

hadoop-daemon.sh start secondarynamenode

But I am wondering if such a long gap will impact image resync process? Will PrimaryNameNode be responsive?

Regards,

JK

,

jan11 · ‎11-25-2016

[SOLVED]

Hello,

First I read this awesome explanation of HDFS Metada directories

Then I regained access to the ambari-server (which was switched off for some reason) and launched the SecondaryNameNode start script via ambari-agent.

As I watched the logfiles I saw it was doing copy of the edits and fsimage from PrimaryNameNode, and the purging old edits.

Now the metadata directories structure of my SecondaryNameNode is the same as on PrimaryNameNode and it is doing checkpoint update every 6h.

 -rw-r--r-- 1 hdfs hadoop 152M 11-24 12:58 edits_0000000000018021905-0000000000019484203
-rw-r--r-- 1 hdfs hadoop  14M 11-24 18:57 edits_0000000000019484204-0000000000019605416
-rw-r--r-- 1 hdfs hadoop  14M 11-25 00:57 edits_0000000000019605417-0000000000019726471
-rw-r--r-- 1 hdfs hadoop  14M 11-25 06:57 edits_0000000000019726472-0000000000019847549
-rw-r--r-- 1 hdfs hadoop  14M 11-25 12:57 edits_0000000000019847550-0000000000019968584
-rw-r--r-- 1 hdfs hadoop  17M 11-25 06:57 fsimage_0000000000019847549
-rw-r--r-- 1 hdfs hadoop   62 11-25 06:57 fsimage_0000000000019847549.md5
-rw-r--r-- 1 hdfs hadoop  17M 11-25 12:57 fsimage_0000000000019968584
-rw-r--r-- 1 hdfs hadoop   62 11-25 12:57 fsimage_0000000000019968584.md5
-rw-r--r-- 1 hdfs hadoop  205 11-25 12:57 VERSION

I the nearest future I will plan to setup a new HDFS cluster with ambari, using latest HDP with HA enabled, and copy the data from the old HDFS.

Thank You very much for your help,

Regards,

JK

View solution in original post

mqureshi · ‎11-19-2016

@Jan K

Couple of things here (Make sure you read my last paragraph).

1. The purpose of secondary namenode is to avoid the edit log from growing too big by merging it with fsimage periodically. So it merges fsimage with edit log and keeps a check point of last merge (1 hour or 1 million transactions by default). This helps keep the edit log from growing too big. This helps restart of namenode faster as namenode starts from fsimage.

2. It can be reasonably assumed that you do not have a situation where namenode has been up for last 10 months and your secondary namenode is dead which means fsimage and edit log are far apart.

3. When you start your secondary namenode , it will sync edit file with fsimage and after that everything should be normal.

Now, what I don't understand is why you need a secondary namenode? Secondary namenode is from days when Hadoop didn't have standby namenode for fail over (Namenode used to be a single point of failure). What you should do is have a standby namenode along with at least three journal nodes to help you sync data between namenodes. That should be it. You don't need secondary namenode. Secondary namenode means you still have single point of failure. Standby namenode means no single point of failure. It also means you are highly unlikely to lose metadata in the event of disk failure because journal nodes will be spread on three machines. May be you already have standby namenode and journal nodes to sync between active and standby namenodes. And that's why probably nobody cared about secondary namenode.

jan11 · ‎11-22-2016

Thanx a lot for your answer.

Regarding last paragraph:

Well, basically I need SecondaryNameNode to be fully operational because as far as I can see there is only PrimaryNameNode (plus dataNodes of course) running on "my inherited infrastructure" and in case of crash/restart of this server, according to the documentation, it can take a lot (how much exactly assuming that the tarred size of /hadoop/hdfs/namenode/current/ is 300MB ?) time to start a PrimaryNameNode process.

The current files in /haddop/hdfs/namenode/current/ directory gives us information that SecondaryNameNode owa operational till 19.02.2016. After its (SecondaryNameNode) crash edit_ files became bigger (expected behavior).

-rw-r--r-- 1 hdfs hadoop  33K 2016-02-18  edits_0000000000004018182-0000000000004018420
-rw-r--r-- 1 hdfs hadoop  33K 2016-02-19  edits_0000000000004018421-0000000000004018656
-rw-r--r-- 1 hdfs hadoop  35K 2016-02-19  edits_0000000000004018657-0000000000004018904
-rw-r--r-- 1 hdfs hadoop  33K 2016-02-19  edits_0000000000004018905-0000000000004019140
-rw-r--r-- 1 hdfs hadoop  33K 2016-02-19  edits_0000000000004019141-0000000000004019376
-rw-r--r-- 1 hdfs hadoop 269M 06-16 11:15 edits_0000000000004019377-0000000000006019392
-rw-r--r-- 1 hdfs hadoop 226M 09-28 08:20 edits_0000000000006019393-0000000000008019731
-rw-r--r-- 1 hdfs hadoop 221M 09-29 22:20 edits_0000000000008019732-0000000000010019773
-rw-r--r-- 1 hdfs hadoop 221M 10-01 12:20 edits_0000000000010019774-0000000000012020578
-rw-r--r-- 1 hdfs hadoop 201M 10-18 23:25 edits_0000000000012020579-0000000000014020981
-rw-r--r-- 1 hdfs hadoop 231M 10-24 09:45 edits_0000000000014020982-0000000000016021135
-rw-r--r-- 1 hdfs hadoop 221M 11-12 05:30 edits_0000000000016021136-0000000000018021904
-rw-r--r-- 1 hdfs hadoop  92M 11-21 16:03 edits_inprogress_0000000000018021905
-rw-r--r-- 1 hdfs hadoop  22M 2016-02-19  fsimage_0000000000004019140
-rw-r--r-- 1 hdfs hadoop   62 2016-02-19  fsimage_0000000000004019140.md5
-rw-r--r-- 1 hdfs hadoop  22M 2016-02-19  fsimage_0000000000004019376
-rw-r--r-- 1 hdfs hadoop   62 2016-02-19  fsimage_0000000000004019376.md5
-rw-r--r-- 1 hdfs hadoop    9 11-12 05:30 seen_txid
-rw-r--r-- 1 hdfs hadoop  205 2014-11-04  VERSION

I would love to implement the StandbyName Node/Nodes if it is implementable without PrimaryNameNode restart (it is a production HDFS :) and my management did not have a clue about importance of theirs own HDFS system 🙂 ).

I do not know the reason why it was stoped/killed, the person responsible for it is unreachable, so I need to do a small investigation. I see that the system was started via ambari.

So to summarize it all:

Is it better in this case (SecondaryNameNode is not running for 10months anyway) to:

1. Start a SecondaryNameNode manually and wait for synchronization of edit file with fsimage with "fingers crossed" 🙂 ?

2. Reconfigure the old SecondaryNameNode and start a StandbyNameNode manually on the same server instead of SecondaryNameNode ?

This is a hadoop 2.4.1 so AFAIK there is only one StandbyNameNode posibble, please correct me if I am wrong.

Thank you,

Jan K

jan11 · ‎11-25-2016

[SOLVED]

Hello,

First I read this awesome explanation of HDFS Metada directories

Then I regained access to the ambari-server (which was switched off for some reason) and launched the SecondaryNameNode start script via ambari-agent.

As I watched the logfiles I saw it was doing copy of the edits and fsimage from PrimaryNameNode, and the purging old edits.

Now the metadata directories structure of my SecondaryNameNode is the same as on PrimaryNameNode and it is doing checkpoint update every 6h.

 -rw-r--r-- 1 hdfs hadoop 152M 11-24 12:58 edits_0000000000018021905-0000000000019484203
-rw-r--r-- 1 hdfs hadoop  14M 11-24 18:57 edits_0000000000019484204-0000000000019605416
-rw-r--r-- 1 hdfs hadoop  14M 11-25 00:57 edits_0000000000019605417-0000000000019726471
-rw-r--r-- 1 hdfs hadoop  14M 11-25 06:57 edits_0000000000019726472-0000000000019847549
-rw-r--r-- 1 hdfs hadoop  14M 11-25 12:57 edits_0000000000019847550-0000000000019968584
-rw-r--r-- 1 hdfs hadoop  17M 11-25 06:57 fsimage_0000000000019847549
-rw-r--r-- 1 hdfs hadoop   62 11-25 06:57 fsimage_0000000000019847549.md5
-rw-r--r-- 1 hdfs hadoop  17M 11-25 12:57 fsimage_0000000000019968584
-rw-r--r-- 1 hdfs hadoop   62 11-25 12:57 fsimage_0000000000019968584.md5
-rw-r--r-- 1 hdfs hadoop  205 11-25 12:57 VERSION

I the nearest future I will plan to setup a new HDFS cluster with ambari, using latest HDP with HA enabled, and copy the data from the old HDFS.

Thank You very much for your help,

Regards,

JK

mqureshi · ‎11-25-2016

Awesome. Thanks for sharing.

Cloudera Community

Support Questions

Adding SecondaryNameNode after 10months. ,

How to make CN=Configuration branch visible in AD

Nameservice nameservice1 has 2 NameNodes and 1 Sec...

Adding new columns to an already partitioned Hive ...

Configuring Ranger Usersync with AD/LDAP for a com...

How to create AD principal accounts using OpenLdap...

Adding A Custom Processor to NiFi : LinkProcessor

Adding nodes to an HDP cluster

Avro to Json adding extra delemeters

Adding Livy Server as service to Apache Knox

Adding cluster failed - how to troubleshoot?