Member since
08-08-2017
1649
Posts
28
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1147 | 06-15-2020 05:23 AM | |
6327 | 01-30-2020 08:04 PM | |
1331 | 07-07-2019 09:06 PM | |
5708 | 01-27-2018 10:17 PM | |
3082 | 12-31-2017 10:12 PM |
02-21-2024
03:30 AM
1 Kudo
we have Hadoop cluster with active/stand by resource manager services the active resource manager is on master1 machine and the stand by resource manager is on master2 machine in our cluster YARN service that include both resource manager services is managing 276 node manager component on workers machines from Ambari WEB UI alerts ( Alerts for Resource Manager ) we notice about the following Resource Manager Web UI Connection failed to http://master2.jupiter.com:8088(timed out) we start to debug the issue by wget with port 8088 , and we found that process is hang on - HTTP request sent, `awaiting response... No data received`. example from resource manager machine wget --debug http://master2.jupiter.com:8088 DEBUG output created by Wget 1.14 on Linux-gnu. URI encoding = ‘UTF-8’ Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) --2024-02-21 10:13:42-- http://master2` .jupiter.com:8088/ Resolving master2.jupiter.com (master2.jupiter.com)... 192.9.201.169 Caching master2.jupiter.com => 192.9.201.169 Connecting to master2.jupiter.com (master2.jupiter.com)|192.9.201.169|:8088... connected. Created socket 3. Releasing 0x0000000000a0da00 (new refcount 1). ---request begin--- GET / HTTP/1.1 User-Agent: Wget/1.14 (linux-gnu) Accept: */* Host: master2.jupiter.com:8088 Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 307 TEMPORARY_REDIRECT Cache-Control: no-cache Expires: Wed, 21 Feb 2024 10:13:42 GMT Date: Wed, 21 Feb 2024 10:13:42 GMT Pragma: no-cache Expires: Wed, 21 Feb 2024 10:13:42 GMT Date: Wed, 21 Feb 2024 10:13:42 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 X-Frame-Options: SAMEORIGIN Location: http://master1.jupiter.com:8088/ Content-Length: 43 Server: Jetty(6.1.26.hwx) ---response end--- 307 TEMPORARY_REDIRECT Registered socket 3 for persistent reuse. URI content encoding = ‘UTF-8’ Location: http://master1.jupiter.com:8088/ [following] Skipping 43 bytes of body: [This is standby RM. The redirect url is: / ] done. URI content encoding = None Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) --2024-02-21 10:13:42-- http://master1.jupiter.com:8088/ conaddr is: 192.9.201.169 Resolving master1.jupiter.com (master1.jupiter.com)... 192.9.66.14 Caching master1.jupiter.com => 192.9.66.14 Releasing 0x0000000000a0f320 (new refcount 1). Found master1.jupiter.com in host_name_addresses_map (0xa0f320) Connecting to master1.jupiter.com (master1.jupiter.com)|192.9.66.14|:8088... connected. Created socket 4. Releasing 0x0000000000a0f320 (new refcount 1). . . . ---response end--- 302 Found Disabling further reuse of socket 3. Closed fd 3 Registered socket 4 for persistent reuse. URI content encoding = ‘UTF-8’ Location: http://master1.jupiter.com:8088/cluster [following] ] done. URI content encoding = None Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8) --2024-02-21 10:27:07-- http://master1.jupiter.com:8088/cluster Reusing existing connection to master1.jupiter.com:8088. Reusing fd 4. ---request begin--- GET /cluster HTTP/1.1 User-Agent: Wget/1.14 (linux-gnu) Accept: */* Host: master1.jupiter.com:8088 Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 200 OK Cache-Control: no-cache Expires: Wed, 21 Feb 2024 10:30:23 GMT Date: Wed, 21 Feb 2024 10:30:23 GMT Pragma: no-cache Expires: Wed, 21 Feb 2024 10:30:23 GMT Date: Wed, 21 Feb 2024 10:30:23 GMT Pragma: no-cache Content-Type: text/html; charset=utf-8 X-Frame-Options: SAMEORIGIN Transfer-Encoding: chunked Server: Jetty(6.1.26.hwx) ---response end--- 200 OK URI content encoding = ‘utf-8’ Length: unspecified [text/html] Saving to: ‘index.html’ [ <=> ] 1,018,917 --.-K/s in 0.04s 2024-02-21 10:31:31 (24.0 MB/s) - ‘index.html’ saved [1018917] as we can see above wget completed after very long time around ~ 20 min instead to completed the process in one or two second we can take tcpdump as tcpdump -vv -s0 tcp port 8088 -w /tmp/why_8088_hang.pcap but I want to understand if there are better simple ways to understand why we get HTTP request sent, awaiting response... , and maybe its related to resource manager service
... View more
Labels:
- Labels:
-
Apache YARN
02-15-2024
09:02 AM
1 Kudo
We have HDP cluster with 152 workers machines - `worker1.duplex.com` .. `worker152.duplex.com` , While all machines are installed on RHEL 7.9 version We are trying to delete the last host - `worker152.duplex.com` from Ambari server or actually from PostgreSQL DB as the following First we need to find the `host_id` select host_id from hosts where host_name='worker152.duplex.com'; and host_id is: host_id --------- 51 (1 row) Now we are deletion this `host_id` - 51 delete from execution_command where task_id in (select task_id from host_role_command where host_id in (51)); delete from host_version where host_id in (51); delete from host_role_command where host_id in (51); delete from serviceconfighosts where host_id in (51); delete from hoststate where host_id in (51); delete from kerberos_principal_host WHERE host_id='worker152.duplex.com'; delete from hosts where host_name in ('worker152.duplex.com'); delete from alert_current where history_id in ( select alert_id from alert_history where host_name in ('worker152.duplex.com')); Now we verify that `host_id` - 51 that represented the host - `worker152.duplex.com` isn't exists By the following verification ambari=> select host_name, public_host_name from hosts; host_name | public_host_name --------------------------+-------------------------- worker1.duplex.com . . . worker151.duplex.com As we can see above the host `worker151.duplex.com` not exist and that's fine , and indeed seems That host - `worker151.duplex.com` was deleted from PostgreSQL DB Now we restarting the `Ambari-server` in order to take affect ( its also restart the PostgreSQL service ) ambari-server restart Using python /usr/bin/python Restarting ambari-server Waiting for server stop... Ambari Server stopped Ambari Server running with administrator privileges. Organizing resource files at /var/lib/ambari-server/resources... Ambari database consistency check started... Server PID at: /var/run/ambari-server/ambari-server.pid Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Waiting for server start......................... Server started listening on 8080 DB configs consistency check: no errors and warnings were found. After Ambari server started , we are surprised because the `host_id` - 51 or host - `worker152.duplex.com` , is still exist as the following ambari=> select host_name, public_host_name from hosts; host_name | public_host_name --------------------------+-------------------------- worker1.duplex.com . . . worker152.duplex.com We not understand why this host back again in spite we delete this record We also tried to delete historical data by the following but this isn't help ambari-server db-purge-history --cluster-name hadoop7 --from-date 2024-01-01 Using python /usr/bin/python Purge database history... Ambari Server configured for Embedded Postgres. Confirm you have made a backup of the Ambari Server database [y/n]yes ERROR: The database purge historical data cannot proceed while Ambari Server is running. Please shut down Ambari first. Ambari Server 'db-purge-history' completed successfully. 1. Why host returned after `Ambari-server` restart ? 2. what is wrong with out deletion process? PostgreSQL Version: postgres=# SHOW server_version; server_version ---------------- 9.2.24 (1 row) links: https://www.andruffsolutions.com/removing-old-host-data-from-ambari-server-and-tuning-the-database/ https://community.cloudera.com/t5/Support-Questions/how-to-remove-old-registered-hosts-from-DB/m-p/217524/highlight/true
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-04-2024
10:59 AM
1 Kudo
you can balance the data-node disks usage by decommission and recommission , but if you have only 2 data-nodes then its a problem better to do it at least 3 data-nodes in cluster
... View more
02-04-2024
10:43 AM
1 Kudo
lets say I copy the fsimage from active to standby namenode and then still we have a problem to start the namenode then can I do the steps as already mentioned?
... View more
02-03-2024
02:20 PM
1 Kudo
we have HDP Hadoop cluster with two name-node services ( one active name-node and the secondary is the standby name-node ) due of unexpected electricity failure , the standby name-node failed to start with the flowing exception , while the active name-node starting successfully 2024-02-02 08:47:11,497 INFO common.Storage (Storage.java:tryLock(776)) - Lock on /hadoop/hdfs/namenode/in_use.lock acquired by nodename 36146@master1.delax.com 2024-02-02 08:47:11,891 INFO namenode.FSImage (FSImage.java:loadFSImageFile(745)) - Planning to load image: FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141) 2024-02-02 08:47:11,897 ERROR namenode.FSImage (FSImage.java:loadFSImage(693)) - Failed to load image from FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141) java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:221) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:898) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:882) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:755) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:686) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1077) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:697) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:761) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1001) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:985) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1710) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1778) 2024-02-02 08:47:12,238 WARN namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(726)) - Encountered exception loading fsimage java.io.IOException: Failed to load FSImage file, see error(s) above for more info. we can see from above exception - `Failed to load image from FSImageFile` , and seems it is as results of when machine failed because unexpected shutdown as I understand one of the options to recover the standby name-node could be with the following procedure: 1. Put Active NN in safemode sudo -u hdfs hdfs dfsadmin -safemode enter 2. Do a savenamespace operation on Active NN sudo -u hdfs hdfs dfsadmin -saveNamespace 3. Leave Safemode sudo -u hdfs hdfs dfsadmin -safemode leave 4. Login to Standby NN 5. Run below command on Standby namenode to get latest fsimage that we saved in above steps. sudo -u hdfs hdfs namenode -bootstrapStandby -force we glad to receive any suggestions , or if my above suggestion is good enough for our problem
... View more
Labels:
- Labels:
-
HDFS
-
Hortonworks Data Platform (HDP)
02-03-2024
02:17 PM
1 Kudo
is the following procedure can help also? Put Active NN in safemode sudo -u hdfs hdfs dfsadmin -safemode enter Do a savenamespace operation on Active NN sudo -u hdfs hdfs dfsadmin -saveNamespace Leave Safemode sudo -u hdfs hdfs dfsadmin -safemode leave Login to Standby NN Run below command on Standby namenode to get latest fsimage that we saved in above steps. sudo -u hdfs hdfs namenode -bootstrapStandby -force
... View more
02-22-2023
08:39 AM
we have HDP cluster version 2.6.5 when we look on name-node logs we can see the following warning 2023-02-20 15:58:31,377 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction 2023-02-20 16:00:39,037 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction
2023-02-20 16:01:43,962 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193594954980-193594954980 took 1329ms
2023-02-20 16:02:47,129 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms
2023-02-20 16:03:52,763 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595106645-193595106646 took 1344ms
2023-02-20 16:04:56,276 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595175233-193595175233 took 1678ms
2023-02-20 16:06:01,067 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595252052-193595252052 took 1265ms
2023-02-20 16:07:06,447 WARN server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595320796-193595320796 took 1273ms in our HDP cluster , HDFS service include 2 name-node services and 3 journal-Nodes cluster include 736 data nodes machines , and HDFS service is the manager of all data-node we want to understand what is the reason for the following warning ? and how to avoid this messages by proactive solution server.Journal (Journal.java:journal(398)) - Sync of transaction range 193595018764-193595018764 took 1321ms
... View more
Labels:
- Labels:
-
Ambari Blueprints
02-22-2023
08:30 AM
for now we have 15 Kafka machines in the cluster , all machines are are installed with RHEL 7.9 and the HW machine is DELL physical machine Kafka version is 2.7 , and we have 3 zookeeper servers that serve the Kafka cluster we decided to extend the Kafka cluster to ~100 machines , because Total Throughput In Megabytes increased dramatic - note according to Kafka confluent calculator we need around 100 Kafka machines in that case I am wonder if our 3 zookeeper servers are enough to serve this huge cluster machines? addition I want to say that our 3 zookeeper servers are already serve other application as HDFS , YARN , HIVE , spark etc.
... View more
Labels:
- Labels:
-
Apache Zookeeper
01-24-2023
09:23 AM
we have HDP cluster version - 2.6.5 with Ambari platform here is example from our Ambari lab cluster with 5 mode managers machines regarding to YARN service - is it possible to add in Ambari the widget that can show CPU core consuming ? if not what are the other ways to find the CORE consuming by YARN from cli ? other way that we found is from the `resource_manager:8088/cluster` as the following so is it possible to find some API / CLI that can capture the VCores Used ?
... View more
Labels:
- Labels:
-
Ambari Blueprints
01-24-2023
08:41 AM
we have spark production cluster with YARN service ( based on HDP 2.6.5 version ) total node-managers services are - 745 ( actually 745 Linux machines ) and yarn active resource-manager and standby resourcemanager are installed on different masters machines we found that the following parameters not defined in our YARN configuration ( yarn-site.xml ) ! yarn.scheduler.increment-allocation-vcores
yarn.scheduler.increment-allocation-mb and above parameters not defined not in Ambari and not in YARN XML configuration files! I want to know what the meaning of the parameter - yarn.scheduler.increment-allocation-vcores ? and what is the affect if this parameters are not defined in our configuration? from YARN best practice configuration we are understanding that both parameters are part of YARN configuration , but we not sure if we must to add them to YARN custom configuration from documentation we found: Minimum and maximum allocation unit in YARN Two resources—memory and CPU, as of in Hadoop 2.5.1, have minimum and maximum allocation unit in YARN, as set by the configurations in yarn-site.xml. Basically, it means RM can only allocate memory to containers in increments of “yarn.scheduler.minimum-allocation-mb” and not exceed “yarn.scheduler.maximum-allocation-mb” It can only allocate CPU vcores to containers in increments of “yarn.scheduler.minimum-allocation-vcores” and not exceed “yarn.scheduler.maximum-allocation-vcores” If changes required, set above configurations in yarn-site.xml on RM nodes, and restart RM. reference: https://docs.trifacta.com/display/r076/Tune+Cluster+Performance https://stackoverflow.com/questions/58522138/how-to-control-yarn-container-allocation-increment-properly https://pratikbarjatya.github.io/learning/best-practices-for-yarn-resource-management/ https://stackoverflow.com/questions/58522138/how-to-control-yarn-container-allocation-increment-properly
... View more
Labels:
- Labels:
-
Ambari Blueprints