Support Questions
Find answers, ask questions, and share your expertise

standby NameNode failure to start - root cause analysis

standby NameNode failure to start - root cause analysis

New Contributor

we encountered on production site the following issues:

  • standby NameNode crashed
  • active NameNode was close to not functioning due to full storage used by the Journal Node

relevant versions: HDP- with HDFS 2.7.3

here's the root cause analysis of this incident along with how it got solved.

the issue was handled and solved by Uri B. 

  1. standby NN on master03 crashed due to small NN heap (only 9GB).
    1. 1 the heap size wasn't sufficient due to the increase in the number of inodes
    2. this increase is a result of the creation of many parquet files due to ETLs that perform coalesce (N) while writing the output.
    3. lessons:
      1. monitor the heap size of NN and increase the NN heap if it starts to suffer from FGC
      2. perform compactization to reduce num parquet files (this also benefits presto queries in order to reduce file scans during queries)
  2. edit logs got corrupted after the standby NN's crash
  3. the standby NN didn't start due to corrupted edit log and also lack of RAM
  4. as a result of the standby NN being down:
    1. the journal nodes (JN) mount point on the standby master started to fill up with edit logs, so the JN mount point reached 99% storage while the standby NN was down
    2. the FSimage wasn't updated for 2 weeks on the master where the standby NN ran
  5. also the JN disks on the active NN got almost full (they started getting full when the standby NN crash)
  6. so the active NN was very close to stop functioning due to the almost full JN mount point. this is how the issue was mitigated:
    1. mounted new, larger disks to the JN mount points
    2. restarted the active NN with much more RAM (50GB instead of 9GB)
    3. the active NN replayed the edit logs for ~1h until it came up
    4. the FSImage was updated to the current time
  7. here's how the standby NN got started:
    1. the FSImage was copied using the bootstrapStandby tool from the active master to the standby master:
      1. on master01 run:
        1. su hdfs
        2. hdfs dfsadmin -safemode enter
        3. hdfs dfsadmin -saveNamespace
        4. hdfs dfsadmin -safemode leave
      2. on master03 run:
        1. su hdfs

        2. hdfs namenode -bootstrapStandby -force
    2. restarted the standby NN with much more RAM (50GB instead of 9GB)
    3. the standby NN didn't crash due to corrupted edit log this time, and started replaying the edit logs into the FSImage