Member since
02-08-2016
793
Posts
669
Kudos Received
85
Solutions
11-18-2016
09:57 AM
6 Kudos
PROBLEM STATEMENT: Running the python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py on the ambari metrics server
I get every service running again, except the ambar-metrics-collector:
The process is running, but there are two alerts left:
Metrics Collector Process - Connection failed: [Errno 111] Connection refused to XXXXX:6188 Metrics Collector - HBase Master Process - Connection failed: [Errno 111] Connection refused to XXXXXX:61310
ERROR: 0x15741734f740003, negotiated timeout = 120000
07:57:42,262 INFO [main] ZooKeeperRegistry:107 - ClusterId read in ZooKeeper is null
07:57:42,341 WARN [main] HeapMemorySizeUtil:55 - hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740001, likely server has closed socket, closing socket connection and attempting reconnect
07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740003, likely server has closed socket, closing socket connection and attempting reconnect
07:58:14,381 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,382 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
07:58:14,961 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,961 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
SYMPTOM:
Java Garbage Collection [GC] pauses often for the AMS master process at the same time frame when these errors are observed.
To verify the same,
1. Review /var/log/ambari-metrics-collector/hbase-ams-master-<hostname>.log 2. Check if messages like the following are printed often and XXXms is larger than a few hundreds ms:
"[JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately XXXms"
3. Also review gc.log-YYYYMMDDHHmm to find out which Java memory area is causing slowness The default location is /var/log/ambari-metrics-collector/ RESOLUTION: Increase the AMS hbase heap sizes as follows: 1. Identify the current heapsize by checking the java process settings [-Xmx, -Xmn:MaxPermSize] by running the following:
ps auxwww | grep 'org.apache.hadoop.hbase.master.HMaster start' 2. Check the free memory in the system by running the following:
free -t
If the server has enough free memory, increase hbase_master_heapsize (and/or) the following based on the GC type identified from gc.log: 1. hbase_master_maxperm_size (and/or)
2. hbase_master_xmn_size
... View more
Labels:
11-17-2016
02:54 PM
6 Kudos
Problem Statement: Created a user in Ranger. After sometime the user is not reflecting in ranger ui. But he user is reflecting in Ranger DB table x_user and in usersync logs we see the user is getting synchronized all the time. The user was LDAP user and there was no issue with other users. ERROR: Below is the issue snap - "testuser" is not displayed in Ranger UI but its reflected in Ranger DB as shown below-
ROOT CAUSE: It seems the database for the particular user was corrupted. RESOLUTION: Inserted below value to the table "x_portal_user_role" after which issue was resolved. INSERT INTO x_portal_user_role VALUES(NULL,'2016-09-09 00:00:00','2016-09-09 00:00:00',1,1,(SELECT id FROM x_portal_user WHERE login_id='XXXX'),'ROLE_USER',1); ### NOTE: Replace XXXX with the login_id(username used for login into ranger portal) of the user ('XXXX')
You can replace 'ROLE_USER' with 'ROLE_SYS_ADMIN' if you want it to be an admin
... View more
Labels:
11-17-2016
02:39 PM
7 Kudos
Problem Statement: Downgrading HDP which failed on restarting Namenode service and struck on below error - resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting... ERROR: ===
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module>
NameNode().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
method(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 720, in restart
self.start(env, upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start
upgrade_suspended=params.upgrade_suspended, env=env)
File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk
return fn(*args, **kwargs)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 185, in namenode
if is_this_namenode_active() is False:
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper
return function(*args, **kwargs)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 555, in is_this_namenode_active
raise Fail(format("The NameNode
{namenode_id}is not listed as Active or Standby, waiting..."))
resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting...
===
Root Cause: Restarting namenode was not able to populate the namenode_id which is required to detect the status of namenode as active / standby Resolution: 1.The dfs.namenode.rpc-address.<cluster-name>.<nn-id> was set to an IP address instead of host name and hence the namenode_id was set to None. We see if we can also check for ip address apart for hostname to retrieve the namenode_id. Below is from the code - # Values for the current Host
namenode_id = None
namenode_rpc = None
dfs_ha_namemodes_ids_list = []
other_namenode_id = None
if dfs_ha_namenode_ids:
dfs_ha_namemodes_ids_list = dfs_ha_namenode_ids.split(",")
dfs_ha_namenode_ids_array_len = len(dfs_ha_namemodes_ids_list)
if dfs_ha_namenode_ids_array_len > 1:
dfs_ha_enabled = True
if dfs_ha_enabled:
for nn_id in dfs_ha_namemodes_ids_list:
nn_host = config['configurations']['hdfs-site'][format('dfs.namenode.rpc-address.{dfs_ha_nameservices}.{nn_id}')]
if hostname in nn_host:
namenode_id = nn_id
namenode_rpc = nn_host 2. And tried continious shuffling namenodes failover using “hdfs haadmin -failover” and which worked to resolve the HDFS issue and the upgrade proceeded further [You need to shuffle namenode using "hdfs haadmin -failover" from nn1 to nn2 and vice versa till ambari restart process is ongoing to detect status of namenode]
... View more
Labels:
11-17-2016
12:45 AM
6 Kudos
ISSUE:
1. While performing HDP upgrade from 2.4.2 to 2.5.0 and which failed on last step Finalize upgrade saying few hosts are not able to upgrade to latest version.
2. Tried to revert the upgrade and proceeded with Downgrade option.
3. While Downgrade it prompt for Atlas and Kafka to be deleted from Cluster.
4. Deleted Atlas and Kafka from cluster.
5. Proceeding further the stopping of service failed on to stop KAFKA
6. The downgrade screen was paused and not it was showing in Ambari UI details tab that it "Failed to start KAFKA_BROKER"
ROOT CAUSE:
Looks like there are two tasks stuck in the PENDING state from the downgrade
RESOLUTION:
1. You need to check at what step the upgrade is stuck using below command -
http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/
2. Pick the latest "request_id" from above output and execute -
http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/<request_id>;
In my case request_id was 858
3. Login to ambari database and use below command with "request_id" to check task which are not in COMPLETED status in host_role_command table as shown below -
ambari=> SELECT task_id, status, event, host_id, role, role_command, command_detail, custom_command_name FROM host_role_command WHERE request_id = 858 AND status != 'COMPLETED' ORDER BY task_id DESC
8964, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, RESTART KAFKA/KAFKA_BROKER, RESTART
8897, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, STOP KAFKA/KAFKA_BROKER, STOP
4. Update the status of the above task to COMPLETED using below command -
UPDATE host_role_command SET status = 'COMPLETED' WHERE request_id = 858 AND status = 'PENDING';
After which it was able to proceed with Downgrade.
... View more
11-16-2016
11:33 AM
7 Kudos
ISSUE: Hive view is not working. ERROR: H100 Unable to submit statement show databases like '*': org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
ROOT CAUSE: Issue was with mysql pool connection size limit exceeded.Check using - mysql> SHOW VARIABLES LIKE "max_connections";
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 100 |
+-----------------+-------+
1 row in set (0.00 sec)
RESOLUTION: Modified mysql pool size limit from 100 to 500 and restarted mysql which resolved the issue. mysql> SET GLOBAL max_connections = 500;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW VARIABLES LIKE "max_connections";
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 500 |
+-----------------+-------+
1 row in set (0.00 sec)
... View more
Labels:
11-16-2016
11:33 AM
7 Kudos
SYMPTOM: Standby NN crashing due to edit log corruption and complaining that OP_CLOSE cannot be applied because the file is not under-construction
ERROR: 2016-09-30T06:23:25.126-0400 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs, replication=3, mtime=1475223680193, atime=1472804384143, blockSize=134217728, blocks=[blk_1243879398_198862467], permissions=gsspe:148973_psdbpe:rwxrwxr-x, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=1585682886]
java.io.IOException: File is not under construction: /appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:436)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:679)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1022)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:741)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:536)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:595)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:762)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:746)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1438)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1504)
ROOT CAUSE: Edit log corruption can happen if append fails with a quota violation. This is BUG
https://issues.apache.org/jira/browse/HDFS-7587
https://hortonworks.jira.com/browse/BUG-56811
https://hortonworks.jira.com/browse/EAR-1248
RESOLUTION: 1. Stop everything
2. Backup the "current" folder of every journalnodes of the cluster
3. Backup the "current" folder of every namenodes of the cluster
4. Use the oev command to convert the binary editlog file into xml
5. Remove the record corresponding to the TXID mentioned in the error
6. Use the oev command to convert the xml editlog file into binary
7. Restart the active namenode
8. I got an error saying there was a gap in the editlogs
9. Take the keytab for the service nn/<host>@<REALM>
10. Execute the command hadoop namenode -recover
11. Answer "c" when the problem of gap occured
12. Then I saw other errors similar to the one I encountered at the beginning (the file not under construction issue)
13. I had to run the command hadoop namenode recover twice in order to get rid of these errors
14. Zookeeper servers were already started, so I started the journalnodes, the datanodes, the zkfc controllers and finally the active namenode
15. Some datanodes were identified as dead. After some investigations, I figured it was the information in zookeeper which were empty, so I restarted zookeeper servers and after the active namenode was there.
15. I started the standby namenode but it raised the same errors concerning the gap in the editlogs.
16. Being the user hdfs, I executed on the standby namenode the command hadoop namenode -bootstrapStandby -force
17. The new FSimage was good and identical to the one on the active namenode
18. I started the standby namenode successfully
19. I launched the rest of the cluster
Also check recovery option given in link - Namenode-Recovery
... View more
Labels:
11-15-2016
02:09 PM
6 Kudos
ISSUE: While performing unkerberizing cluster all services were down and nothing was coming up. Also the unkeberized cluster step failed. The start of services was failed. Tried to manually start Namenodes which came up but the status was not displayed correctly in Ambari UI. The journal node were not able to start and was failing with error as shown below. ERROR: Screenshot is attached below Journal node error: ROOT CAUSE: There were multiple issue as below - 1. From the JN error it says "missing spnego keytab". From the error It seems the kerberos was not properly disabled on cluster. 2. As checked in hdfs-site.xml the property "hadoop.http.authentication.type" was set to kerberos. 3. Oozie was not able to detect active namenode, since the property "hadoop.http.authentication.simple.anonymous.allowed" was set to false. RESOLUTION: 1. Setting hadoop.http.authentication.type to simple in hdfs-site.xml, HDFS was able to restart 2. Setting the property hadoop.http.authentication.simple.anonymous.allowed=true in hdfs-site.xml oozie was able to detect active namenode and also namenode status was corrrectly displayed in namenode UI.
... View more
Labels:
11-15-2016
02:09 PM
7 Kudos
updateddeleteuser.zipISSUE: Ranger ldap integration was working fine. Customer delete user from ranger UI and was facing issue while re-importing user in Ranger.
ROOT CAUSE: Customer removed the users from ranger UI and expected that the user should be automatically imported from ranger usersync process Below are sample screenshot - User named 'testuser' is deleted from Ranger UI. But below you can see the user is still available in database.
RESOLUTION: There are multiple tables which has the entry for the user. You need to run delete script to delete the user entries from database and re-start ranger usersync process to re-import the user. Please find attach delete script- Syntax to run the script - $ deleteUser.sh -f input.txt -u ranger_user -p password -db ranger [-r <replaceUser>]
... View more
Labels:
11-15-2016
05:29 AM
6 Kudos
ISSUE: After enabling Ambari SSL Hive views stopped working. ERROR: 08 Nov 2016 11:32:23,330 WARN [qtp-ambari-client-263] nio:720 - javax.net.ssl.SSLException: Received fatal alert: certificate_unknown
08 Nov 2016 11:32:23,331 ERROR [qtp-ambari-client-256] ServiceFormattedException:100 - org.apache.ambari.view.utils.ambari.AmbariApiException: RA040 I/O error while requesting Ambari
org.apache.ambari.view.utils.ambari.AmbariApiException: RA040 I/O error while requesting Ambari
at org.apache.ambari.view.utils.ambari.AmbariApi.requestClusterAPI(AmbariApi.java:176)
at org.apache.ambari.view.utils.ambari.AmbariApi.requestClusterAPI(AmbariApi.java:142)
at org.apache.ambari.view.utils.ambari.AmbariApi.getHostsWithComponent(AmbariApi.java:99)
at org.apache.ambari.view.hive.client.ConnectionFactory.getHiveHost(ConnectionFactory.java:79)
at org.apache.ambari.view.hive.client.ConnectionFactory.create(ConnectionFactory.java:68)
at org.apache.ambari.view.hive.client.UserLocalConnection.initialValue(UserLocalConnection.java:42)
at org.apache.ambari.view.hive.client.UserLocalConnection.initialValue(UserLocalConnection.java:26)
at org.apache.ambari.view.utils.UserLocal.get(UserLocal.java:66)
at org.apache.ambari.view.hive.resources.browser.HiveBrowserService.databases(HiveBrowserService.java:87)
at sun.reflect.GeneratedMethodAccessor186.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
Root Cause: Truststore configuration for Ambari Server was missing. Resolution: Setup the trustore for ambari server as per link below after which above issue was resolved. https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.1/bk_Ambari_Security_Guide/content/_set_up_truststore_for_ambari_server.html
... View more
Labels:
11-14-2016
05:30 PM
6 Kudos
SYMPTOM: During HDP upgrade from 2.3 to 2.5 YARN check is failing due to NoSuchMethodError org.apache.hadoop.yarn.api.records.Resource.getMemorySize()J ERROR: Below was the error in application logs - 16/11/14 10:30:12 FATAL distributedshell.ApplicationMaster: Error running ApplicationMaster
java.lang.NoSuchMethodError: org.apache.hadoop.yarn.api.records.Resource.getMemorySize()J
at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:585)
at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:298)
ROOT CAUSE: There was issue with classpath where the nodemanager on which the job was running was pointing to older version[ie. 2.3] classpath. RESOLUTION: There are two solutions as below - 1. Skip this step in Ambari upgrade UI and proceed. Ambari will take care of setting up the classpath. 2. Modify the classpath manually and confirm the set classpath using "hadoop classpath" command and re-run the service check.
... View more