Member since
02-22-2024
28
Posts
13
Kudos Received
0
Solutions
01-10-2026
12:24 AM
@rizalt FYI ➤It appears your cluster is experiencing a Quorum Failure, where the critical metadata services (NameNode, JournalNodes, and ZooKeeper) are losing the ability to maintain a majority when one Data Center (DC) goes offline. ➤ Analyzing Your Failure In a High Availability (HA) setup, the NameNodes rely on a "Quorum" of JournalNodes (JN) and ZooKeeper (ZK) nodes to stay alive. If you have 5 JNs, you must have at least 3 running for either NameNode to function. Based on your diagram: The Problem: If you split your 5 JournalNodes evenly (e.g., 2 in DC1 and 3 in DC2), and the DC with 3 JNs goes down, the remaining 2 JNs cannot form a quorum. This causes both NameNodes to shut down immediately to prevent data corruption ("Split Brain"). DataNodes vs. NameNodes: In HDFS, the number of DataNodes (DN) that fail has no direct impact on whether the NameNode stays "Up" or "Down." You could lose 15 out of 16 DataNodes, and the NameNode should still stay active. The fact that your NameNodes are crashing when 8 DNs (one full server/DC) go down proves that your Quorum nodes (JN/ZK) are failing, not the DataNodes. ➤ The Solution: Quorum Placement To survive a full Data Center failure (50% of your physical infrastructure), you cannot rely on an even split of nodes. You need a third location (a "Witness" site) or an asymmetric distribution. 1. The 3-Site Strategy (Recommended) To handle a 1-DC failure with 5 JournalNodes and 5 ZooKeeper nodes, place them as follows: -DC 1: 2 JN, 2 ZK -DC 2: 2 JN, 2 ZK -Site 3 (Witness): 1 JN, 1 ZK (This can be a very small virtual machine or cloud instance). Why this works: If DC1 or DC2 fails, the remaining site + the Witness site equals 3 nodes, which satisfies the quorum ($3 > 5/2$). 2. Maximum DataNode Failure Theoretically: You can lose all but 1 DataNode ($N-1$), and the NameNode will stay "Active." Practically: If you have a replication factor of 3, and you lose 50% of your nodes, many blocks will become "Under-replicated," and some may become "Missing" if all three copies were in the DC that died. Solution: Ensure your Rack Awareness is configured so HDFS knows which nodes belong to which DC. This forces HDFS to keep at least one copy of data in each DC. ➤ Why 11 JN and 11 ZK didn't work Increasing the number of nodes to 11 actually makes the cluster more fragile if they are only placed in two locations. With 11 nodes, you need 6 to be alive to form a quorum. If you have 5 in DC1 and 6 in DC2, and DC2 fails, the 5 remaining nodes in DC1 cannot reach the 6-node requirement. ➤ Checklist for Survival Reduce to 5 JNs and 5 ZKs: Too many nodes increase network latency and management overhead. Add a 3rd Location: Even a single low-power node in a different building or cloud region to act as the "tie-breaker." Check dfs.namonode.edits.dir: Ensure the NameNodes are configured to point to all JournalNodes by URI. ZooKeeper FC: Ensure DFS ZK Failover Controller is running on both NameNode hosts.
... View more
07-02-2025
01:34 PM
hi @rizalt , from your report, you probably have snapshots enabled for this directory, so any delete in this directory will not be 100% effective unless the snapshot is also deleted. deleting the snapshot will make it impossible to recover data if necessary. so, on the namenode webui page, in the "snapshot" tab, check your snapshots.
... View more
05-23-2025
04:48 PM
Hello @rizalt Have you tested the same query directly on beeline? Can you try it?
... View more
09-17-2024
08:37 AM
1 Kudo
@rizalt There is very little detail in your post. NiFi will run as whatever user is used to start it unless the "run.as" property is set in the NiFi bootstrap.conf file. If the user trying to execute the "./nifi.sh start" command is not the root user and you set the "run.as" property to "root", that user would need sudo permissions in linux to start NiFi as the root user. The "run.as" property is ignored on Windows where the service will always be owned by user that starts it. NOTE: Starting the service as a different user then it was previously started at will not trigger a change in file ownership in NiFi directories. You would need to update file ownership manually be starting as a different issue (this includes all NiFi's repositories). While "root" user has access to all files regardless of owner, issues will exist if no root user launches app and files are owned by another user including root. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
07-23-2024
10:56 PM
@rizalt Yes, the Key Version Numbers (KVNO) of different principals can indeed be different. Each principal in Kerberos can have its own KVNO, which is an identifier that increments each time the key for that principal is changed. Reference: https://web.mit.edu/kerberos/www/krb5-latest/doc/user/user_commands/kvno.html#:~:text=specified%20Kerberos%20principals Regards, Chethan YM
... View more
07-22-2024
06:03 PM
Thanks for the reply @shubham_sharma, I'm not using AD account just kerberos
... View more
06-12-2024
06:59 AM
@rizalt It's hard to talk to a solution here without a lot more details (version of stack, ambari, java, etc). That said, some of the hits I find against this error are from 2017(many years ago). There is a great risk to operate Ambari as it is no longer supported and available within the Cloudera ecosystem. I would recommend you evaluate modern solutions (CDP) for the same stack of services you are familar with in Ambari.
... View more
06-11-2024
10:29 PM
1 Kudo
Hi @rizalt The error is because you have not provided keytab path here the command should look like below: > klist -k example.keytab To create the keytab you can refer any of below steps: $ ktutil
ktutil: addent -password -p myusername@FEDORAPROJECT.ORG -k 42 -f
Password for myusername@FEDORAPROJECT.ORG:
ktutil: wkt /tmp/kt/fedora.keytab
ktutil: q Then kinit -kt /tmp/kt/fedora.keytab myusername@FEDORAPROJECT.ORG Note: Replace the username and REALM as per your cluster configurations. Regards, Chethan YM
... View more
06-11-2024
12:56 AM
I tried command kinit to make sure the password is correct, but message kinit is " password incorrect while getting initial credential" like below root@master1:~# kinit nm/slave1.hadoop.com@HADOOP.COM
Password for nm/slave1.hadoop.com@HADOOP.COM:
kinit: Password incorrect while getting initial credentials What should recreate principal/change the password ? Please give me suggestion, I'm sure the password is correct
... View more
06-07-2024
04:07 PM
1 Kudo
@Shelton I'm using Ubuntu 22.04 & using ODP (https://clemlabs.s3.eu-west-3.amazonaws.com/ubuntu22/odp-release/1.2.2.0-46/ODP)
... View more