Member since
01-19-2017
3681
Posts
633
Kudos Received
372
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1643 | 06-04-2025 11:36 PM | |
| 2089 | 03-23-2025 05:23 AM | |
| 999 | 03-17-2025 10:18 AM | |
| 3779 | 03-05-2025 01:34 PM | |
| 2603 | 03-03-2025 01:09 PM |
07-29-2021
01:04 PM
@Vinay1991 The ZK's look okay please go through the list I shared about the connectivity. Please validae one by one.
... View more
07-28-2021
10:30 AM
1 Kudo
@Vinay1991 From the logs, I see connectivity loss and that's precisely what's causing the NN switch. Remember in my earlier posting the importance of Zk quorum! Your NN and losing Connection to the ZK so the NN that loses active connection is causing the ZK to elect a new leader and that's happening in a loop Caused by : java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read I would start by checking FW I see you are on Ubuntu so ensure the FW is disabled across the Cluster. Identifying and Fixing Socket Timeouts The root cause of a Socket Timeout is a connectivity failure between the machines, so try the usual process Check the settings: is this the machine you really wanted to talk to? From the machine that is raising the exception, can you resolve the hostname? Is that resolved hostname the correct one? Can you ping the remote host? Is the target machine running the relevant Hadoop processes? Can you telnet to the target host and port? Can you telnet to the target host and port from any other machine? On the target machine, can you telnet to the port using localhost as the hostname? If this works but external network connections time out, it's usually a firewall issue. If it is a remote object store: is the address correct? Does it go away when you repeat the operation? Does it only happen on bulk operations? If the latter, it's probably due to throttling at the far end. Check your hostname resolution DNS or /etc/hosts should be in sync and another important thing is all your host time should be in sync. can you share the value of Core-site.xml parameter ha.zookeeper.quorum
... View more
07-27-2021
12:35 PM
2 Kudos
@Vinay1991 Unfortunately, you haven't described your cluster setup but my assumption is that you have 3 Zk's in your HA implementation. There are two components deployed to Hadoop HDFS for implementing Automatic Failover. These two components are- ZKFailoverController process(ZKFC) ZooKeeper quorum (3 Zk's) 1. ZKFailoverController(ZKFC) The ZKFC is the ZooKeeper client, who is also responsible for managing and monitoring the NameNode state. ZKFC is a client that runs on all nodes on the Hadoop cluster, which is running NameNode. These 2 components are responsible for: Health monitoring ZKFC is accountable for health monitoring heart beating the NameNode with health-check commands periodically. As long as the NameNode responds with a healthy status timely, it considers the NameNode as healthy. In this case, if the NameNode got crashed, froze, or entered an unhealthy state, then it marks the NameNode as unhealthy. ZooKeeper session management It is also responsible for the session management with ZooKeeper. The ZKFC maintains a session open in the ZooKeeper when the local Namenode is healthy. Also, if the Local NameNode is the active NameNode, then with the session, it also holds a special lock “znode”. This lock uses ZooKeeper support for the ”ephemeral” nodes. Thus, if the session gets expires, the lock node will be deleted automatically. ZooKeeper-based election When the local Namenode is healthy and ZKFC finds that no other NameNode acquires the lock “znode”, then it will try by itself to acquire the lock. If it gets successful in obtaining the lock, then ZKFC has won the election, and now it is responsible for running the failover to make its local NameNode active. The failover process run by the ZKFC is similar to the failover process run by the manual failover described in the NameNode High Availability article. 2. ZooKeeper quorum A ZK quorum is a highly available service for maintaining little amounts of coordination data. It notifies the clients about the changes in that data. It monitors clients for the failures. The HDFS implementation of automatic failover depends on ZooKeeper for the following things: How does it detect NN Failure each NameNode machine in the Hadoop cluster maintains a persistent session in the ZooKeeper. If any of the machines crashes, then the ZooKeeper session maintained will get expire—zooKeeper than reveal to all the other NameNodes to start the failover process. To exclusively select the active NameNode, ZooKeeper provides a simple mechanism. In the case of active NameNode failure, another standby NameNode may take the special exclusive lock in the ZooKeeper, stating that it should become the next active NameNode. After the initialization of Health Monitor is completed, internal threads are started to call the method corresponding to the HASERVICE Protocol RPC interface of NameNode periodically to detect the health status of NameNode. If the Health Monitor detects a change in the health status of NameNode, it calls back the corresponding method registered by ZKFailover Controller for processing. If ZKFailover Controller decides that a primary-standby switch is needed, it will first use Active Standby Elector to conduct an automatic primary election. Active Standby Elector interacts with Zookeeper to complete an automatic backup election. Active Standby Elector calls back the corresponding method of ZKFailover Controller to notify the current NameNode to become the main NameNode or the standby NameNode after the primary election is completed. ZKFailover Controller calls the HASERVICE Protocol RPC interface corresponding to NameNode to convert NameNode to Active or Standby state. Taking all the above into account, the first component logs to check are Zk and NN, /var/log/hadoop/hdfs
/var/log/zookeeper My suspicion is you have issues with the Namenode heartbeat which makes the zookeeper fail to get the pingback in time and marks the NN as dead and elects a new leader and that keeps happening in a loop. So check those ZK logs to ensure time is set correctly and is in sync! Please revert
... View more
07-26-2021
10:51 PM
@sipocootap2 Unfortunately, you cannot disallow snapshots in a snapshottable directory that already has snapshots! Yes, you will have to list and delete the snapshot even if it contains subdirs you only pass the root snapshot in the hdfs dfs -deleteSnapshot command. If you had an $ hdfs dfs -ls /app/tomtest/.snapshot
Found 2 items
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/sipo/work/john
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/tap2/work//peter You would simply delete the snapshots like $ hdfs dfs -deleteSnapshot /app/tomtest/ sipo
$ hdfs dfs -deleteSnapshot /app/tomtest/ tap2
... View more
07-26-2021
02:53 PM
1 Kudo
@sipocootap2 Here is a walkthrough on how to delete a snapshot Created a directory $ hdfs dfs -mkdir -p /app/tomtest Changed the owner $ hdfs dfs -chown -R tom:developer /app/tomtest To be able to create a snapshot the directory has to be snapshottable $ hdfs dfsadmin -allowSnapshot /app/tomtest
Allowing snaphot on /app/tomtest succeeded Now I created 3 snapshots $ hdfs dfs -createSnapshot /app/tomtest sipo
Created snapshot /app/tomtest/.snapshot/sipo
$ hdfs dfs -createSnapshot /app/tomtest coo
Created snapshot /app/tomtest/.snapshot/coo
$ hdfs dfs -createSnapshot /app/tomtest tap2
Created snapshot /app/tomtest/.snapshot/tap2 Confirm the directory is snapshottable $ hdfs lsSnapshottableDir
drwxr-xr-x 0 tom developer 0 2021-07-26 23:14 3 65536 /app/tomtest List all the snapshots in the directory $ hdfs dfs -ls /app/tomtest/.snapshot
Found 3 items
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/coo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/sipo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/tap2 Now I need to delete the snapshot coo $ hdfs dfs -deleteSnapshot /app/tomtest/ coo Confirm the snapshot is gone $ hdfs dfs -ls /app/tomtest/.snapshot
Found 2 items
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/sipo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/tap2 Voila To delete a snapshot the format is hdfs dfs -deleteSnapshot <path> <snapshotName> i.e hdfs dfs -deleteSnapshot /app/tomtest/ coo notice the space and omittion of the .snapshot as all .(dot) files the snapshot directory is not visible with normal hdfs command The -ls command gives 0 results $ hdfs dfs -ls /app/tomtest/ The special command shows the 2 remaining snapshots $ hdfs dfs -ls /app/tomtest/.snapshot
Found 2 items
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/sipo
drwxr-xr-x - tom developer 0 2021-07-26 23:14 /app/tomtest/.snapshot/tap2 Is there a command to disallow snapshots for all the subdirectories? Yes there is only after you have deleted all the snapshots therein demo, or better at directory creation time you can disallow snapshots $ hdfs dfsadmin -disallowSnapshot /app/tomtest/
disallowSnapshot: The directory /app/tomtest has snapshot(s). Please redo the operation after removing all the snapshots. The only way I have found which works when for me and permits me to have a cup of coffee is to first list all the snapshots and copy-paste the delete even if there are 60 snapshots it works and I only get back when the snapshots are gone or better still do something else while the deletion is going on not automated though the example The below would run concurrently hdfs dfs -deleteSnapshot /app/tomtest/ sipo
.....
....
hdfs dfs -deleteSnapshot /app/tomtest/ tap2 -deleteSnapshot skips trash by default! Happy hadooping
... View more
07-26-2021
01:20 PM
@enirys As suggested we need more details and there is no silver bullet a piece of advance from experience it's better you open a new thread and give as much details as possible. OS HDP version Ambari Mit or AD kerberos Documented steps or official document reference Your Kerberos config krb5.conf, kdc.conf kadm5.acl Hosts files Node number [Single or Multi node] Just any information that will reduce the too many exchange of posts but gives members the info needed to help. Cheers
... View more
07-26-2021
02:27 AM
@ambari275 Great please accept the answer so the thread can be closed and referenced byother users Happy hadooping !!!
... View more
07-26-2021
01:24 AM
@ambari275 These are the steps to follow see below Assumptions logged as root clustername=test REALM= DOMAIN.COM Hostname = host1 logged in as root [root@host1]# Switch to user HDFS the HDFS superuser [root@host1]# su - hdfs Check the HDFS associated keytab generated [hdfs@host1 ~]$ cd /etc/security/keytabs/
[hdfs@host1 keytabs]$ ls Sample output atlas.service.keytab hdfs.headless.keytab knox.service.keytab oozie.service.keytab Now use the hdfs.headless.keytab to get the associated principal [hdfs@host1 keytabs]$ klist -kt /etc/security/keytabs/hdfs.headless.keytab Expected output Keytab name: FILE:/etc/security/keytabs/hdfs.headless.keytab
KVNO Timestamp Principal
---- ------------------- ------------------------------------------------------
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM
1 07/26/2021 00:34:03 hdfs-test@DOMAIN.COM Grab a Kerberos ticket by using the keytab+ principal like username/pèassword to authenticate to KDC [hdfs@host1 keytabs]$ kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-test@DOMAIN.COM Check you no have a valid Kerberos ticket [hdfs@host1 keytabs]$ klist Sample output Ticket cache: FILE:/tmp/krb5cc_1013
Default principal: hdfs-test@DOMAIN.COM
Valid starting Expires Service principal
07/26/2021 10:03:17 07/27/2021 10:03:17 krbtgt/DOMAIN.COM@DOMAIN.COM Now you can list successfully the HDFS directories, remember to -ls it seems you forgot it in your earlier command [hdfs@host1 keytabs]$ hdfs dfs -ls /
Found 9 items
drwxrwxrwx - yarn hadoop 0 2018-09-24 00:31 /app-logs
drwxr-xr-x - hdfs hdfs 0 2018-09-24 00:22 /apps
drwxr-xr-x - yarn hadoop 0 2018-09-24 00:12 /ats
drwxr-xr-x - hdfs hdfs 0 2018-09-24 00:12 /hdp
drwxr-xr-x - mapred hdfs 0 2018-09-24 00:12 /mapred
drwxrwxrwx - mapred hadoop 0 2018-09-24 00:12 /mr-history
drwxrwxrwx - spark hadoop 0 2021-07-26 10:04 /spark2-history
drwxrwxrwx - hdfs hdfs 0 2021-07-26 00:57 /tmp
drwxr-xr-x - hdfs hdfs 0 2018-09-24 00:23 /user Voila happy hadooping and remember to accept the best response so other users could reference it
... View more
07-25-2021
02:15 PM
@ambari275 I have gone through the logs and here are my observations Error: WARNING: A HTTP GET method, public javax.ws.rs.core.Response org.apache.ambari.server.api.services.ExtensionsService.getExtensionVersions(java.lang.String,javax.ws.rs.core.HttpHeaders,javax.ws.rs.core.UriInfo,java.lang.String), should not consume any entity. Solution: To fix the issue: # cat /etc/ambari-server/conf/ambari.properties | grep client.threadpool.size.max
client.threadpool.size.max=25 The client.threadpool.size.max property indicates a number of parallel threads servicing client requests. To find the number of cores on the server, issue Linux command nproc # nproc
25 1) Edit /etc/ambari-server/conf/ambari.properties file and change the default value of client.threadpool.size.max to have the number of cores on your machine. client.threadpool.size.max=25 2) Restart ambari-server # ambari-server restart Error 2021-07-23 12:43:42,673 WARN [Stack Version Loading Thread] RepoVdfCallable:142 - Could not load version definition for HDP-3.0 identified by https://archive.cloudera.com/p/HDP/centos7/3.x/3.0.1.0/HDP-3.0.1.0-187.xml. Server returned HTTP response code: 401 for URL: https://archive.cloudera.com/p/HDP/centos7/3.x/3.0.1.0/HDP-3.0.1.0-187.xml java.io.IOException: Server returned HTTP response code: 401 for URL: https://archive.cloudera.com/p/HDP/centos7/3.x/3.0.1.0/HDP-3.0.1.0-187.xml Reason: 401 means "Unauthorized", so there must be something with your credentials this is purely an authorization issue. It seems your access to the HDP repos is an issue. Your krb5.conf should look something like this # cat /etc/krb5.conf # Configuration snippets may be placed in this directory as well
includedir /etc/krb5.conf.d/
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
dns_lookup_realm = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
rdns = false
default_realm = DOMAIN.COM
default_ccache_name = KEYRING:persistent:%{uid}
[realms]
DOMAIN.COM = {
kdc = [FQDN 10.1.1.150]
admin_server =[FQDN 10.1.1.150]
}
[domain_realm]
.domain.com = DOMAIN.COM
domain.com = DOMAIN.COM Your /etc/host I think I remember once having issues with hostnames with - try using host1 for ESXI-host2 etc and please don't comment out the IPV6 entry it can cause network connectivity issue so please remove the be # on the second line x.x.x.x localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.x.x.x FQDN server
x.x.x.x host1
x.x.x.x host2
x.x.x.x host3 Kerberos service uses DNS to resolve hostnames. Therefore, DNS must be enabled on all hosts. With DNS, the principal must contain the fully qualified domain name (FQDN) of each host. For example, if the hostname is host1, the DNS domain name is domain.com, and the realm name is DOMAIN.COM, then the principal name for the host would be host/host1.domain.com@DOMAIN.COM. The examples in this guide require that DNS is configured and that the FQDN is used for each host. Also, ensure ambari agents is installed on all hosts including the ambari-server! Ensure on all the hosts the hostname point to the Ambari server [server]
hostname=<FQDN_oF_Ambari_server>
url_port=8440
secured_url_port=8441
connect_retry_delay=10
max_reconnect_retry_delay=30 Please revert
... View more
07-23-2021
10:57 AM
@ambari275 You can set up the kerberos server anywhere on the network provided it can be accessed by the hosts in your cluster. I suspect there is d^something wrong with yor Ambari server. Can you share your /var/log/ambari-server/ambari-server.log I asked for a couple of files but you only shared the krb5.conf. I will need the rest of the files to be able to understand and determine what could be the issue. Can describe your setup? Number of Nodes,network, OS etc
... View more