About sshimpi

sshimpi · ‎11-18-2016

PROBLEM STATEMENT: Running the python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py on the ambari metrics server I get every service running again, except the ambar-metrics-collector: The process is running, but there are two alerts left: Metrics Collector Process - Connection failed: [Errno 111] Connection refused to XXXXX:6188 Metrics Collector - HBase Master Process - Connection failed: [Errno 111] Connection refused to XXXXXX:61310 ERROR: 0x15741734f740003, negotiated timeout = 120000 07:57:42,262 INFO [main] ZooKeeperRegistry:107 - ClusterId read in ZooKeeper is null 07:57:42,341 WARN [main] HeapMemorySizeUtil:55 - hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size 07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740001, likely server has closed socket, closing socket connection and attempting reconnect 07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740003, likely server has closed socket, closing socket connection and attempting reconnect 07:58:14,381 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 07:58:14,382 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740001 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 07:58:14,961 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 07:58:14,961 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused SYMPTOM: Java Garbage Collection [GC] pauses often for the AMS master process at the same time frame when these errors are observed. To verify the same, 1. Review /var/log/ambari-metrics-collector/hbase-ams-master-<hostname>.log 2. Check if messages like the following are printed often and XXXms is larger than a few hundreds ms: "[JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately XXXms" 3. Also review gc.log-YYYYMMDDHHmm to find out which Java memory area is causing slowness The default location is /var/log/ambari-metrics-collector/ RESOLUTION: Increase the AMS hbase heap sizes as follows: 1. Identify the current heapsize by checking the java process settings [-Xmx, -Xmn:MaxPermSize] by running the following: ps auxwww | grep 'org.apache.hadoop.hbase.master.HMaster start' 2. Check the free memory in the system by running the following: free -t If the server has enough free memory, increase hbase_master_heapsize (and/or) the following based on the GC type identified from gc.log: 1. hbase_master_maxperm_size (and/or) 2. hbase_master_xmn_size

sshimpi · ‎11-17-2016

Problem Statement: Created a user in Ranger. After sometime the user is not reflecting in ranger ui. But he user is reflecting in Ranger DB table x_user and in usersync logs we see the user is getting synchronized all the time. The user was LDAP user and there was no issue with other users. ERROR: Below is the issue snap - "testuser" is not displayed in Ranger UI but its reflected in Ranger DB as shown below- ROOT CAUSE: It seems the database for the particular user was corrupted. RESOLUTION: Inserted below value to the table "x_portal_user_role" after which issue was resolved. INSERT INTO x_portal_user_role VALUES(NULL,'2016-09-09 00:00:00','2016-09-09 00:00:00',1,1,(SELECT id FROM x_portal_user WHERE login_id='XXXX'),'ROLE_USER',1); ### NOTE: Replace XXXX with the login_id(username used for login into ranger portal) of the user ('XXXX') You can replace 'ROLE_USER' with 'ROLE_SYS_ADMIN' if you want it to be an admin

sshimpi · ‎11-17-2016

Problem Statement: Downgrading HDP which failed on restarting Namenode service and struck on below error - resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting... ERROR: === Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module> NameNode().execute() File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute method(env) File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 720, in restart self.start(env, upgrade_type=upgrade_type) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start upgrade_suspended=params.upgrade_suspended, env=env) File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk return fn(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 185, in namenode if is_this_namenode_active() is False: File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper return function(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 555, in is_this_namenode_active raise Fail(format("The NameNode {namenode_id}is not listed as Active or Standby, waiting...")) resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting... === Root Cause: Restarting namenode was not able to populate the namenode_id which is required to detect the status of namenode as active / standby Resolution: 1.The dfs.namenode.rpc-address.<cluster-name>.<nn-id> was set to an IP address instead of host name and hence the namenode_id was set to None. We see if we can also check for ip address apart for hostname to retrieve the namenode_id. Below is from the code - # Values for the current Host namenode_id = None namenode_rpc = None dfs_ha_namemodes_ids_list = [] other_namenode_id = None if dfs_ha_namenode_ids: dfs_ha_namemodes_ids_list = dfs_ha_namenode_ids.split(",") dfs_ha_namenode_ids_array_len = len(dfs_ha_namemodes_ids_list) if dfs_ha_namenode_ids_array_len > 1: dfs_ha_enabled = True if dfs_ha_enabled: for nn_id in dfs_ha_namemodes_ids_list: nn_host = config['configurations']['hdfs-site'][format('dfs.namenode.rpc-address.{dfs_ha_nameservices}.{nn_id}')] if hostname in nn_host: namenode_id = nn_id namenode_rpc = nn_host 2. And tried continious shuffling namenodes failover using “hdfs haadmin -failover” and which worked to resolve the HDFS issue and the upgrade proceeded further [You need to shuffle namenode using "hdfs haadmin -failover" from nn1 to nn2 and vice versa till ambari restart process is ongoing to detect status of namenode]

sshimpi · ‎11-17-2016

ISSUE: 1. While performing HDP upgrade from 2.4.2 to 2.5.0 and which failed on last step Finalize upgrade saying few hosts are not able to upgrade to latest version. 2. Tried to revert the upgrade and proceeded with Downgrade option. 3. While Downgrade it prompt for Atlas and Kafka to be deleted from Cluster. 4. Deleted Atlas and Kafka from cluster. 5. Proceeding further the stopping of service failed on to stop KAFKA 6. The downgrade screen was paused and not it was showing in Ambari UI details tab that it "Failed to start KAFKA_BROKER" ROOT CAUSE: Looks like there are two tasks stuck in the PENDING state from the downgrade RESOLUTION: 1. You need to check at what step the upgrade is stuck using below command - http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/ 2. Pick the latest "request_id" from above output and execute - http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/<request_id>; In my case request_id was 858 3. Login to ambari database and use below command with "request_id" to check task which are not in COMPLETED status in host_role_command table as shown below - ambari=> SELECT task_id, status, event, host_id, role, role_command, command_detail, custom_command_name FROM host_role_command WHERE request_id = 858 AND status != 'COMPLETED' ORDER BY task_id DESC 8964, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, RESTART KAFKA/KAFKA_BROKER, RESTART 8897, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, STOP KAFKA/KAFKA_BROKER, STOP 4. Update the status of the above task to COMPLETED using below command - UPDATE host_role_command SET status = 'COMPLETED' WHERE request_id = 858 AND status = 'PENDING'; After which it was able to proceed with Downgrade.

sshimpi · ‎11-16-2016

ISSUE: Hive view is not working. ERROR: H100 Unable to submit statement show databases like '*': org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out ROOT CAUSE: Issue was with mysql pool connection size limit exceeded.Check using - mysql> SHOW VARIABLES LIKE "max_connections"; +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 100 | +-----------------+-------+ 1 row in set (0.00 sec) RESOLUTION: Modified mysql pool size limit from 100 to 500 and restarted mysql which resolved the issue. mysql> SET GLOBAL max_connections = 500; Query OK, 0 rows affected (0.00 sec) mysql> SHOW VARIABLES LIKE "max_connections"; +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 500 | +-----------------+-------+ 1 row in set (0.00 sec)

sshimpi · ‎11-16-2016

SYMPTOM: Standby NN crashing due to edit log corruption and complaining that OP_CLOSE cannot be applied because the file is not under-construction ERROR: 2016-09-30T06:23:25.126-0400 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs, replication=3, mtime=1475223680193, atime=1472804384143, blockSize=134217728, blocks=[blk_1243879398_198862467], permissions=gsspe:148973_psdbpe:rwxrwxr-x, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=1585682886] java.io.IOException: File is not under construction: /appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:436) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:679) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1022) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:741) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:536) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:595) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:762) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:746) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1438) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1504) ROOT CAUSE: Edit log corruption can happen if append fails with a quota violation. This is BUG https://issues.apache.org/jira/browse/HDFS-7587 https://hortonworks.jira.com/browse/BUG-56811 https://hortonworks.jira.com/browse/EAR-1248 RESOLUTION: 1. Stop everything 2. Backup the "current" folder of every journalnodes of the cluster 3. Backup the "current" folder of every namenodes of the cluster 4. Use the oev command to convert the binary editlog file into xml 5. Remove the record corresponding to the TXID mentioned in the error 6. Use the oev command to convert the xml editlog file into binary 7. Restart the active namenode 8. I got an error saying there was a gap in the editlogs 9. Take the keytab for the service nn/<host>@<REALM> 10. Execute the command hadoop namenode -recover 11. Answer "c" when the problem of gap occured 12. Then I saw other errors similar to the one I encountered at the beginning (the file not under construction issue) 13. I had to run the command hadoop namenode recover twice in order to get rid of these errors 14. Zookeeper servers were already started, so I started the journalnodes, the datanodes, the zkfc controllers and finally the active namenode 15. Some datanodes were identified as dead. After some investigations, I figured it was the information in zookeeper which were empty, so I restarted zookeeper servers and after the active namenode was there. 15. I started the standby namenode but it raised the same errors concerning the gap in the editlogs. 16. Being the user hdfs, I executed on the standby namenode the command hadoop namenode -bootstrapStandby -force 17. The new FSimage was good and identical to the one on the active namenode 18. I started the standby namenode successfully 19. I launched the rest of the cluster Also check recovery option given in link - Namenode-Recovery

sshimpi · ‎11-15-2016

ISSUE: While performing unkerberizing cluster all services were down and nothing was coming up. Also the unkeberized cluster step failed. The start of services was failed. Tried to manually start Namenodes which came up but the status was not displayed correctly in Ambari UI. The journal node were not able to start and was failing with error as shown below. ERROR: Screenshot is attached below Journal node error: ROOT CAUSE: There were multiple issue as below - 1. From the JN error it says "missing spnego keytab". From the error It seems the kerberos was not properly disabled on cluster. 2. As checked in hdfs-site.xml the property "hadoop.http.authentication.type" was set to kerberos. 3. Oozie was not able to detect active namenode, since the property "hadoop.http.authentication.simple.anonymous.allowed" was set to false. RESOLUTION: 1. Setting hadoop.http.authentication.type to simple in hdfs-site.xml, HDFS was able to restart 2. Setting the property hadoop.http.authentication.simple.anonymous.allowed=true in hdfs-site.xml oozie was able to detect active namenode and also namenode status was corrrectly displayed in namenode UI.

sshimpi · ‎11-15-2016

updateddeleteuser.zipISSUE: Ranger ldap integration was working fine. Customer delete user from ranger UI and was facing issue while re-importing user in Ranger. ROOT CAUSE: Customer removed the users from ranger UI and expected that the user should be automatically imported from ranger usersync process Below are sample screenshot - User named 'testuser' is deleted from Ranger UI. But below you can see the user is still available in database. RESOLUTION: There are multiple tables which has the entry for the user. You need to run delete script to delete the user entries from database and re-start ranger usersync process to re-import the user. Please find attach delete script- Syntax to run the script - $ deleteUser.sh -f input.txt -u ranger_user -p password -db ranger [-r <replaceUser>]

sshimpi · ‎11-15-2016

ISSUE: After enabling Ambari SSL Hive views stopped working. ERROR: 08 Nov 2016 11:32:23,330 WARN [qtp-ambari-client-263] nio:720 - javax.net.ssl.SSLException: Received fatal alert: certificate_unknown 08 Nov 2016 11:32:23,331 ERROR [qtp-ambari-client-256] ServiceFormattedException:100 - org.apache.ambari.view.utils.ambari.AmbariApiException: RA040 I/O error while requesting Ambari org.apache.ambari.view.utils.ambari.AmbariApiException: RA040 I/O error while requesting Ambari at org.apache.ambari.view.utils.ambari.AmbariApi.requestClusterAPI(AmbariApi.java:176) at org.apache.ambari.view.utils.ambari.AmbariApi.requestClusterAPI(AmbariApi.java:142) at org.apache.ambari.view.utils.ambari.AmbariApi.getHostsWithComponent(AmbariApi.java:99) at org.apache.ambari.view.hive.client.ConnectionFactory.getHiveHost(ConnectionFactory.java:79) at org.apache.ambari.view.hive.client.ConnectionFactory.create(ConnectionFactory.java:68) at org.apache.ambari.view.hive.client.UserLocalConnection.initialValue(UserLocalConnection.java:42) at org.apache.ambari.view.hive.client.UserLocalConnection.initialValue(UserLocalConnection.java:26) at org.apache.ambari.view.utils.UserLocal.get(UserLocal.java:66) at org.apache.ambari.view.hive.resources.browser.HiveBrowserService.databases(HiveBrowserService.java:87) at sun.reflect.GeneratedMethodAccessor186.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) Root Cause: Truststore configuration for Ambari Server was missing. Resolution: Setup the trustore for ambari server as per link below after which above issue was resolved. https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.1/bk_Ambari_Security_Guide/content/_set_up_truststore_for_ambari_server.html

sshimpi · ‎11-14-2016

SYMPTOM: During HDP upgrade from 2.3 to 2.5 YARN check is failing due to NoSuchMethodError org.apache.hadoop.yarn.api.records.Resource.getMemorySize()J ERROR: Below was the error in application logs - 16/11/14 10:30:12 FATAL distributedshell.ApplicationMaster: Error running ApplicationMaster java.lang.NoSuchMethodError: org.apache.hadoop.yarn.api.records.Resource.getMemorySize()J at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:585) at org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:298) ROOT CAUSE: There was issue with classpath where the nodemanager on which the job was running was pointing to older version[ie. 2.3] classpath. RESOLUTION: There are two solutions as below - 1. Skip this step in Ambari upgrade UI and proceed. Ambari will take care of setting up the classpath. 2. Modify the classpath manually and confirm the set classpath using "hadoop classpath" command and re-run the service check.

Online	Offline
Last Visited	‎12-07-2017 08:26 AM

Member Since	‎02-08-2016 09:06 AM
Last Visited	‎12-07-2017 08:26 AM
Posts	793
Kudos received	666

Cloudera Community

Ambari Metrics Collector not able to start

Ranger not able to reflect user in UI but user is ...

HDP Downgrade stuck on Namenode restart

HDP Downgrade Issue - Kafka service failed to stop

HIVE View error - H100 Unable to submit statement ...

Standby namenode crashing due to edit log corrupti...

Unkerberized cluster broke everything

Reimporting or sync users from ldap - Ranger

E090 RA040 I/O error while requesting Ambari [Amba...

YARN check is failing due to NoSuchMethodError org...