About sshimpi

sshimpi · ‎11-18-2016

PROBLEM STATEMENT: There are frequent email alerts from HiveServer2 Metastore process which are misleading. When check the HiveServer2 Metastore process is up and running fine without issue. ERROR: Below is sample email alert - Ambari Alerts log output shows - /var/log/ambari-server/ambari-alerts.log ROOT CAUSE: The issue was with kerberos credentials for ambari-qa user. RESOLUTION: HiveServer2 Metastore alert is triggered from the node on which the service is installed. "ambari-qa" user is used by ambari to trigger the alert. If you are using kerberized cluster make sure you have "ambari-qa" ticket in place else false alert will be triggered. You can use below command to check - $su - ambari-qa $klist <--[make sure you have valid ticket] $hive --hiveconf hive.metastore.uris=thrift://<hive_mestastore_host>:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e 'show databases;' Also check, sometimes you might hit - https://issues.apache.org/jira/browse/AMBARI-14424

sshimpi · ‎11-18-2016

PROBLEM STATEMENT: I would like to change the default password for rangertagsync. According to the manual https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_command-line-installation/content/ch13s05s04.html I should just run the python updatetagadminpassword.py after changing the password in ranger. However I get error as shown below - ERROR: [root@test ranger-tagsync]# pwd /usr/hdp/current/ranger-tagsync [root@test ranger-tagsync]# python updatetagadminpassword.py 2016-11-18 09:30:08,311 [E] Required file not found: [/usr/hdp/2.5.0.0-1245/ranger-tagsync/conf/ranger-tagsync-site.xml] [root@test ranger-tagsync]# Debugging Steps: From the above logs its clear that "ranger-tagsync-site.xml" is missing.But when check using "rpm -qa |grep tagsync" I see the rpm was installed [root@thakur2 ranger-tagsync]# rpm -qa |grep tagsync ranger_2_5_0_0_1245-tagsync-0.6.0.2.5.0.0-1245.el6.x86_64 2. Check the "rpm -ql" on package to make sure "ranger-tagsync-site.xml" comes from same package, but i didn't found "ranger-tagsync-site.xml" [root@thakur2 ranger-tagsync]# rpm -ql ranger_2_5_0_0_1245-tagsync-0.6.0.2.5.0.0-1245.el6.x86_64 |grep xml /usr/hdp/2.5.0.0-1245/etc/ranger/tagsync/conf.dist/log4j.xml /usr/hdp/2.5.0.0-1245/ranger-tagsync/conf.dist/log4j.xml /usr/hdp/2.5.0.0-1245/ranger-tagsync/templates/installprop2xml.properties /usr/hdp/2.5.0.0-1245/ranger-tagsync/templates/ranger-tagsync-template.xml 3. Tried removing ranger-tagsync on test cluster and re-installed. From Ambari operations logs i see that while installation of Ranger Tagsync the file is auto generated and is not owned by any package. Check the below output - RESOLUTION: Reinstalled "ranger_2_5_0_0_1245-tagsync" which resolved the issue. I was able to execute the password change script after reinstalling ranger-tagsync [root@test ranger-tagsync]# python updatetagadminpassword.py 2016-11-18 09:48:22,774 [I] Using Java:/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java getting values from file : /usr/hdp/2.5.0.0-1245/ranger-tagsync/conf/ranger-tagsync-site.xml Enter Destination NAME (Ranger/Atlas):

sshimpi · ‎11-18-2016

PROBLEM STATEMENT: #hdfs dfsadmin -report shows "Blocks with corrupt replica : 20 " & #hdfs fsck / shows " Corrupt blocks : 0 " But.. "Corrupted block" shows as 20 from Ambari and Summary of HDFS services shows " Block 20 corrupt / 0 missing / ... " ERROR: Please find the screenshot below - Ambari Output: FSCK output: DFSADMIN output: ROOT CAUSE: Ambari shows its "Corrupted Block" but its "Corrupted Replica" RESOLUTION: This is a BUG -https://hortonworks.jira.com/browse/BUG-41958, which is fixed in Ambari version-2.2.2

sshimpi · ‎11-18-2016

PROBLEM STATEMENT: Running the python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py on the ambari metrics server I get every service running again, except the ambar-metrics-collector: The process is running, but there are two alerts left: Metrics Collector Process - Connection failed: [Errno 111] Connection refused to XXXXX:6188 Metrics Collector - HBase Master Process - Connection failed: [Errno 111] Connection refused to XXXXXX:61310 ERROR: 0x15741734f740003, negotiated timeout = 120000 07:57:42,262 INFO [main] ZooKeeperRegistry:107 - ClusterId read in ZooKeeper is null 07:57:42,341 WARN [main] HeapMemorySizeUtil:55 - hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size 07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740001, likely server has closed socket, closing socket connection and attempting reconnect 07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740003, likely server has closed socket, closing socket connection and attempting reconnect 07:58:14,381 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 07:58:14,382 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740001 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 07:58:14,961 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 07:58:14,961 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused SYMPTOM: Java Garbage Collection [GC] pauses often for the AMS master process at the same time frame when these errors are observed. To verify the same, 1. Review /var/log/ambari-metrics-collector/hbase-ams-master-<hostname>.log 2. Check if messages like the following are printed often and XXXms is larger than a few hundreds ms: "[JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately XXXms" 3. Also review gc.log-YYYYMMDDHHmm to find out which Java memory area is causing slowness The default location is /var/log/ambari-metrics-collector/ RESOLUTION: Increase the AMS hbase heap sizes as follows: 1. Identify the current heapsize by checking the java process settings [-Xmx, -Xmn:MaxPermSize] by running the following: ps auxwww | grep 'org.apache.hadoop.hbase.master.HMaster start' 2. Check the free memory in the system by running the following: free -t If the server has enough free memory, increase hbase_master_heapsize (and/or) the following based on the GC type identified from gc.log: 1. hbase_master_maxperm_size (and/or) 2. hbase_master_xmn_size

sshimpi · ‎11-17-2016

Problem Statement: Created a user in Ranger. After sometime the user is not reflecting in ranger ui. But he user is reflecting in Ranger DB table x_user and in usersync logs we see the user is getting synchronized all the time. The user was LDAP user and there was no issue with other users. ERROR: Below is the issue snap - "testuser" is not displayed in Ranger UI but its reflected in Ranger DB as shown below- ROOT CAUSE: It seems the database for the particular user was corrupted. RESOLUTION: Inserted below value to the table "x_portal_user_role" after which issue was resolved. INSERT INTO x_portal_user_role VALUES(NULL,'2016-09-09 00:00:00','2016-09-09 00:00:00',1,1,(SELECT id FROM x_portal_user WHERE login_id='XXXX'),'ROLE_USER',1); ### NOTE: Replace XXXX with the login_id(username used for login into ranger portal) of the user ('XXXX') You can replace 'ROLE_USER' with 'ROLE_SYS_ADMIN' if you want it to be an admin

sshimpi · ‎11-17-2016

Problem Statement: Downgrading HDP which failed on restarting Namenode service and struck on below error - resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting... ERROR: === Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module> NameNode().execute() File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute method(env) File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 720, in restart self.start(env, upgrade_type=upgrade_type) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start upgrade_suspended=params.upgrade_suspended, env=env) File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk return fn(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 185, in namenode if is_this_namenode_active() is False: File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper return function(*args, **kwargs) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 555, in is_this_namenode_active raise Fail(format("The NameNode {namenode_id}is not listed as Active or Standby, waiting...")) resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting... === Root Cause: Restarting namenode was not able to populate the namenode_id which is required to detect the status of namenode as active / standby Resolution: 1.The dfs.namenode.rpc-address.<cluster-name>.<nn-id> was set to an IP address instead of host name and hence the namenode_id was set to None. We see if we can also check for ip address apart for hostname to retrieve the namenode_id. Below is from the code - # Values for the current Host namenode_id = None namenode_rpc = None dfs_ha_namemodes_ids_list = [] other_namenode_id = None if dfs_ha_namenode_ids: dfs_ha_namemodes_ids_list = dfs_ha_namenode_ids.split(",") dfs_ha_namenode_ids_array_len = len(dfs_ha_namemodes_ids_list) if dfs_ha_namenode_ids_array_len > 1: dfs_ha_enabled = True if dfs_ha_enabled: for nn_id in dfs_ha_namemodes_ids_list: nn_host = config['configurations']['hdfs-site'][format('dfs.namenode.rpc-address.{dfs_ha_nameservices}.{nn_id}')] if hostname in nn_host: namenode_id = nn_id namenode_rpc = nn_host 2. And tried continious shuffling namenodes failover using “hdfs haadmin -failover” and which worked to resolve the HDFS issue and the upgrade proceeded further [You need to shuffle namenode using "hdfs haadmin -failover" from nn1 to nn2 and vice versa till ambari restart process is ongoing to detect status of namenode]

it_duanxiong · ‎11-17-2016

Thank you

sshimpi · ‎11-17-2016

ISSUE: 1. While performing HDP upgrade from 2.4.2 to 2.5.0 and which failed on last step Finalize upgrade saying few hosts are not able to upgrade to latest version. 2. Tried to revert the upgrade and proceeded with Downgrade option. 3. While Downgrade it prompt for Atlas and Kafka to be deleted from Cluster. 4. Deleted Atlas and Kafka from cluster. 5. Proceeding further the stopping of service failed on to stop KAFKA 6. The downgrade screen was paused and not it was showing in Ambari UI details tab that it "Failed to start KAFKA_BROKER" ROOT CAUSE: Looks like there are two tasks stuck in the PENDING state from the downgrade RESOLUTION: 1. You need to check at what step the upgrade is stuck using below command - http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/ 2. Pick the latest "request_id" from above output and execute - http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/<request_id>; In my case request_id was 858 3. Login to ambari database and use below command with "request_id" to check task which are not in COMPLETED status in host_role_command table as shown below - ambari=> SELECT task_id, status, event, host_id, role, role_command, command_detail, custom_command_name FROM host_role_command WHERE request_id = 858 AND status != 'COMPLETED' ORDER BY task_id DESC 8964, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, RESTART KAFKA/KAFKA_BROKER, RESTART 8897, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, STOP KAFKA/KAFKA_BROKER, STOP 4. Update the status of the above task to COMPLETED using below command - UPDATE host_role_command SET status = 'COMPLETED' WHERE request_id = 858 AND status = 'PENDING'; After which it was able to proceed with Downgrade.

sshimpi · ‎11-16-2016

ISSUE: Hive view is not working. ERROR: H100 Unable to submit statement show databases like '*': org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out ROOT CAUSE: Issue was with mysql pool connection size limit exceeded.Check using - mysql> SHOW VARIABLES LIKE "max_connections"; +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 100 | +-----------------+-------+ 1 row in set (0.00 sec) RESOLUTION: Modified mysql pool size limit from 100 to 500 and restarted mysql which resolved the issue. mysql> SET GLOBAL max_connections = 500; Query OK, 0 rows affected (0.00 sec) mysql> SHOW VARIABLES LIKE "max_connections"; +-----------------+-------+ | Variable_name | Value | +-----------------+-------+ | max_connections | 500 | +-----------------+-------+ 1 row in set (0.00 sec)

sshimpi · ‎11-16-2016

SYMPTOM: Standby NN crashing due to edit log corruption and complaining that OP_CLOSE cannot be applied because the file is not under-construction ERROR: 2016-09-30T06:23:25.126-0400 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs, replication=3, mtime=1475223680193, atime=1472804384143, blockSize=134217728, blocks=[blk_1243879398_198862467], permissions=gsspe:148973_psdbpe:rwxrwxr-x, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=1585682886] java.io.IOException: File is not under construction: /appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:436) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:679) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1022) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:741) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:536) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:595) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:762) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:746) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1438) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1504) ROOT CAUSE: Edit log corruption can happen if append fails with a quota violation. This is BUG https://issues.apache.org/jira/browse/HDFS-7587 https://hortonworks.jira.com/browse/BUG-56811 https://hortonworks.jira.com/browse/EAR-1248 RESOLUTION: 1. Stop everything 2. Backup the "current" folder of every journalnodes of the cluster 3. Backup the "current" folder of every namenodes of the cluster 4. Use the oev command to convert the binary editlog file into xml 5. Remove the record corresponding to the TXID mentioned in the error 6. Use the oev command to convert the xml editlog file into binary 7. Restart the active namenode 8. I got an error saying there was a gap in the editlogs 9. Take the keytab for the service nn/<host>@<REALM> 10. Execute the command hadoop namenode -recover 11. Answer "c" when the problem of gap occured 12. Then I saw other errors similar to the one I encountered at the beginning (the file not under construction issue) 13. I had to run the command hadoop namenode recover twice in order to get rid of these errors 14. Zookeeper servers were already started, so I started the journalnodes, the datanodes, the zkfc controllers and finally the active namenode 15. Some datanodes were identified as dead. After some investigations, I figured it was the information in zookeeper which were empty, so I restarted zookeeper servers and after the active namenode was there. 15. I started the standby namenode but it raised the same errors concerning the gap in the editlogs. 16. Being the user hdfs, I executed on the standby namenode the command hadoop namenode -bootstrapStandby -force 17. The new FSimage was good and identical to the one on the active namenode 18. I started the standby namenode successfully 19. I launched the rest of the cluster Also check recovery option given in link - Namenode-Recovery

Online	Offline
Last Visited	‎12-07-2017 08:26 AM

Member Since	‎02-08-2016 09:06 AM
Last Visited	‎12-07-2017 08:26 AM
Posts	793
Kudos received	667

Cloudera Community

Re: Issue with Ranger User/group sync

Re: Ranger HDFS test connection fails

Re: Error while configuring NameNode High Availabi...

Re: Ranger policies on HDFS

Re: Can we do column value level restriction in Ap...

Frequent Ambari alert on HiveServer2 Metastore Pro...

RangerTagSync passwordchange not working

Ambari shows corrupted blocks whereas FSCK output ...

Ambari Metrics Collector not able to start

Ranger not able to reflect user in UI but user is ...

HDP Downgrade stuck on Namenode restart

Re: ranger-kafka-plugin no kerbers question

HDP Downgrade Issue - Kafka service failed to stop

HIVE View error - H100 Unable to submit statement ...

Standby namenode crashing due to edit log corrupti...