Support Questions

Find answers, ask questions, and share your expertise

Why does Standby Namenode read from Journals Nodes when Datanodes already send block report to it?

avatar
Rising Star

Hi,


As per my knowledge (please correct me if I am wrong), the Datanodes sends the block report to both Active and Standby Namenodes.

The job of Active NN is to write to the Journal Nodes and the job of Standby namenode is to read from Journal nodes.

Now why does Standby namenode need to read from Journal nodes when the Datanodes (slaves) are already sending the block reports to it?

4 REPLIES 4

avatar

The above was originally posted in the Community Help Track. On Wed May 22 03:20 UTC 2019, a member of the HCC moderation staff moved it to the Hadoop Core track. The Community Help Track is intended for questions about using the HCC site itself.

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Master Mentor

@Shesh Kumar

Read carefully the bold text of the Quorum Journal nodes, it states why the standby namenode reads the edits from the journal node and does not read the block reports from the Active Namenode so it has nothing to do with reading the block reports from the JN but the FSImage and edits files

Namenode

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

The inodes and the list of blocks that define the metadata of the name system are called the image. NameNode keeps the entire namespace image in RAM. The persistent record of the image stored in the NameNode's local native filesystem is called a checkpoint. The NameNode records changes to HDFS in a write-ahead log called the journal in its local native filesystem. The location of block replicas is not part of the persistent checkpoint.

Each client-initiated transaction is recorded in the journal, and the journal file is flushed and synced before the acknowledgment is sent to the client. The checkpoint file is never changed by the NameNode; a new file is written when a checkpoint is created during a restart when requested by the administrator, or by the CheckpointNode described in the next section. During startup, the NameNode initializes the namespace image from the checkpoint and then replays changes from the journal. A new checkpoint and an empty journal are written back to the storage directories before the NameNode starts serving clients.

For improved durability, redundant copies of the checkpoint and journal are typically stored on multiple independent local volumes and at remote NFS servers. The first choice prevents loss from a single volume failure, and the second choice protects against the failure of the entire node. If the NameNode encounters an error writing the journal to one of the storage directories it automatically excludes that directory from the list of storage directories. The NameNode automatically shuts itself down if no storage directory is available.

The NameNode is a multithreaded system and processes requests simultaneously from multiple clients. Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete. In order to optimize this process, the NameNode batches multiple transactions. When one of the NameNode's threads initiates a flush-and-sync operation, all the transactions batched at that time are committed together. Remaining threads only need to check that their transactions have been saved and do not need to initiate a flush-and-sync operation.


DataNodes

During startup, each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode, the DataNode automatically shuts down.

The namespace ID is assigned to the filesystem instance when it is formatted. The namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join the cluster, thus protecting the integrity of the filesystem. A DataNode that is newly initialized and without any namespace ID is permitted to join the cluster and receive the cluster's namespace ID.

After the handshake, the DataNode registers with the NameNode. DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that.

A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block ID, the generation stamp and the length for each block replica the server hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the cluster.

During normal operation, DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules the creation of new replicas of those blocks on other DataNodes.

Heartbeats from a DataNode also carry information about total storage capacity, the fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode's block allocation and load balancing decisions.

The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node.


High Availability

The HDFS NameNode High Availability feature enables you to run redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This eliminates the NameNode as a potential single point of failure (SPOF) in an HDFS cluster.


Standby Namenode

It does three things:

  • Merging fsimage and edits-log files.
  • Receive online updates(Journal nodes) of the file system meta-data, apply them to its memory state and persist them on disks just like the name-node does.
  • Performs checkpoints of the namespace state


Thus at any time, the standby namenode contains an up-to-date image of the namespace both in memory and on local disk(s). The cluster will switch over to the new name-node (this standby-node) if the active namenode dies


Quorum Journal Nodes

QJN is HDFS implementation that provides edit logs. It permits to share these edit logs between the active and standby NameNode.

Standby Namenode communicates and synchronizes with the active NameNode for high availability. It will happen by a group of daemons called “Journal nodes”. The Quorum Journal Nodes runs as a group of journal nodes. At least three journal nodes should be there.

For N journal nodes, the system can tolerate at most (N-1)/2 failures. The system thus continues to work. So, for three journal nodes, the system can tolerate the failure of one {(3-1)/2} of them.

Whenever an active node performs any modification, it logs modification to all journal nodes. The standby node reads the edits from the journal nodes and applies to its own Namespace in a constant manner. In the case of failover, the standby will ensure that it has read all the edits from the journal nodes before promoting itself to the Active state. This ensures that the namespace state is completely synchronized before a failure occurs.

To provide a fast failover, the standby node must have up-to-date information about the location of data blocks in the cluster. For this to happen, IP address of both the NameNode is available to all the data nodes and they send block location information and heartbeats to both NameNode.

HTH

avatar
Rising Star

@Geoffrey Shelton Okot : Thank you for the detailed explanation.

avatar
Master Mentor

@Shesh Kumar

I am happy this compilation has helped give you a better understanding

If you found this answer addressed your question, please take a moment to log in and click the "accept" link on the answer.

That would be a great help to Community users to find the solution quickly for these kinds of errors.

Happy Hadooping