Member since
02-08-2016
793
Posts
669
Kudos Received
85
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3067 | 06-30-2017 05:30 PM | |
3988 | 06-30-2017 02:57 PM | |
3312 | 05-30-2017 07:00 AM | |
3884 | 01-20-2017 10:18 AM | |
8403 | 01-11-2017 02:11 PM |
11-18-2016
09:57 AM
6 Kudos
PROBLEM STATEMENT: There are frequent email alerts from HiveServer2 Metastore process which are misleading. When check the HiveServer2 Metastore process is up and running fine without issue.
ERROR: Below is sample email alert -
Ambari Alerts log output shows - /var/log/ambari-server/ambari-alerts.log
ROOT CAUSE: The issue was with kerberos credentials for ambari-qa user.
RESOLUTION: HiveServer2 Metastore alert is triggered from the node on which the service is installed. "ambari-qa" user is used by ambari to trigger the alert. If you are using kerberized cluster make sure you have "ambari-qa" ticket in place else false alert will be triggered. You can use below command to check - $su - ambari-qa
$klist <--[make sure you have valid ticket]
$hive --hiveconf hive.metastore.uris=thrift://<hive_mestastore_host>:9083 --hiveconf hive.metastore.client.connect.retry.delay=1 --hiveconf hive.metastore.failure.retries=1 --hiveconf hive.metastore.connect.retries=1 --hiveconf hive.metastore.client.socket.timeout=14 --hiveconf hive.execution.engine=mr -e 'show databases;' Also check, sometimes you might hit - https://issues.apache.org/jira/browse/AMBARI-14424
... View more
Labels:
11-18-2016
09:57 AM
6 Kudos
PROBLEM STATEMENT: I would like to change the default password for rangertagsync. According to the manual https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_command-line-installation/content/ch13s05s04.html I should just run the python updatetagadminpassword.py after changing the password in ranger. However I get error as shown below - ERROR: [root@test ranger-tagsync]# pwd
/usr/hdp/current/ranger-tagsync
[root@test ranger-tagsync]# python updatetagadminpassword.py
2016-11-18 09:30:08,311 [E] Required file not found: [/usr/hdp/2.5.0.0-1245/ranger-tagsync/conf/ranger-tagsync-site.xml]
[root@test ranger-tagsync]#
Debugging Steps:
From the above logs its clear that "ranger-tagsync-site.xml" is missing.But when check using "rpm -qa |grep tagsync" I see the rpm was installed
[root@thakur2 ranger-tagsync]# rpm -qa |grep tagsync
ranger_2_5_0_0_1245-tagsync-0.6.0.2.5.0.0-1245.el6.x86_64
2. Check the "rpm -ql" on package to make sure "ranger-tagsync-site.xml" comes from same package, but i didn't found "ranger-tagsync-site.xml" [root@thakur2 ranger-tagsync]# rpm -ql ranger_2_5_0_0_1245-tagsync-0.6.0.2.5.0.0-1245.el6.x86_64 |grep xml
/usr/hdp/2.5.0.0-1245/etc/ranger/tagsync/conf.dist/log4j.xml
/usr/hdp/2.5.0.0-1245/ranger-tagsync/conf.dist/log4j.xml
/usr/hdp/2.5.0.0-1245/ranger-tagsync/templates/installprop2xml.properties
/usr/hdp/2.5.0.0-1245/ranger-tagsync/templates/ranger-tagsync-template.xml
3. Tried removing ranger-tagsync on test cluster and re-installed. From Ambari operations logs i see that while installation of Ranger Tagsync the file is auto generated and is not owned by any package. Check the below output -
RESOLUTION: Reinstalled "ranger_2_5_0_0_1245-tagsync" which resolved the issue. I was able to execute the password change script after reinstalling ranger-tagsync [root@test ranger-tagsync]# python updatetagadminpassword.py
2016-11-18 09:48:22,774 [I] Using Java:/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
getting values from file : /usr/hdp/2.5.0.0-1245/ranger-tagsync/conf/ranger-tagsync-site.xml
Enter Destination NAME (Ranger/Atlas):
... View more
Labels:
11-18-2016
09:57 AM
6 Kudos
PROBLEM STATEMENT: #hdfs dfsadmin -report shows "Blocks with corrupt replica : 20 " &
#hdfs fsck / shows " Corrupt blocks : 0 " But.. "Corrupted block" shows as 20 from Ambari and Summary of HDFS services shows " Block 20 corrupt / 0 missing / ... " ERROR: Please find the screenshot below - Ambari Output: FSCK output: DFSADMIN output:
ROOT CAUSE: Ambari shows its "Corrupted Block" but its "Corrupted Replica"
RESOLUTION: This is a BUG -https://hortonworks.jira.com/browse/BUG-41958, which is fixed in Ambari version-2.2.2
... View more
Labels:
11-18-2016
09:57 AM
6 Kudos
PROBLEM STATEMENT: Running the python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py on the ambari metrics server
I get every service running again, except the ambar-metrics-collector:
The process is running, but there are two alerts left:
Metrics Collector Process - Connection failed: [Errno 111] Connection refused to XXXXX:6188 Metrics Collector - HBase Master Process - Connection failed: [Errno 111] Connection refused to XXXXXX:61310
ERROR: 0x15741734f740003, negotiated timeout = 120000
07:57:42,262 INFO [main] ZooKeeperRegistry:107 - ClusterId read in ZooKeeper is null
07:57:42,341 WARN [main] HeapMemorySizeUtil:55 - hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740001, likely server has closed socket, closing socket connection and attempting reconnect
07:58:13,170 INFO [main-SendThread(localhost:61181)] ClientCnxn:1142 - Unable to read additional data from server sessionid 0x15741734f740003, likely server has closed socket, closing socket connection and attempting reconnect
07:58:14,381 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,382 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
07:58:14,961 INFO [main-SendThread(localhost:61181)] ClientCnxn:1019 - Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
07:58:14,961 WARN [main-SendThread(localhost:61181)] ClientCnxn:1146 - Session 0x15741734f740003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
SYMPTOM:
Java Garbage Collection [GC] pauses often for the AMS master process at the same time frame when these errors are observed.
To verify the same,
1. Review /var/log/ambari-metrics-collector/hbase-ams-master-<hostname>.log 2. Check if messages like the following are printed often and XXXms is larger than a few hundreds ms:
"[JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately XXXms"
3. Also review gc.log-YYYYMMDDHHmm to find out which Java memory area is causing slowness The default location is /var/log/ambari-metrics-collector/ RESOLUTION: Increase the AMS hbase heap sizes as follows: 1. Identify the current heapsize by checking the java process settings [-Xmx, -Xmn:MaxPermSize] by running the following:
ps auxwww | grep 'org.apache.hadoop.hbase.master.HMaster start' 2. Check the free memory in the system by running the following:
free -t
If the server has enough free memory, increase hbase_master_heapsize (and/or) the following based on the GC type identified from gc.log: 1. hbase_master_maxperm_size (and/or)
2. hbase_master_xmn_size
... View more
Labels:
11-17-2016
02:54 PM
6 Kudos
Problem Statement: Created a user in Ranger. After sometime the user is not reflecting in ranger ui. But he user is reflecting in Ranger DB table x_user and in usersync logs we see the user is getting synchronized all the time. The user was LDAP user and there was no issue with other users. ERROR: Below is the issue snap - "testuser" is not displayed in Ranger UI but its reflected in Ranger DB as shown below-
ROOT CAUSE: It seems the database for the particular user was corrupted. RESOLUTION: Inserted below value to the table "x_portal_user_role" after which issue was resolved. INSERT INTO x_portal_user_role VALUES(NULL,'2016-09-09 00:00:00','2016-09-09 00:00:00',1,1,(SELECT id FROM x_portal_user WHERE login_id='XXXX'),'ROLE_USER',1); ### NOTE: Replace XXXX with the login_id(username used for login into ranger portal) of the user ('XXXX')
You can replace 'ROLE_USER' with 'ROLE_SYS_ADMIN' if you want it to be an admin
... View more
Labels:
11-17-2016
02:39 PM
7 Kudos
Problem Statement: Downgrading HDP which failed on restarting Namenode service and struck on below error - resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting... ERROR: ===
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module>
NameNode().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
method(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 720, in restart
self.start(env, upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start
upgrade_suspended=params.upgrade_suspended, env=env)
File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk
return fn(*args, **kwargs)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 185, in namenode
if is_this_namenode_active() is False:
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper
return function(*args, **kwargs)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 555, in is_this_namenode_active
raise Fail(format("The NameNode
{namenode_id}is not listed as Active or Standby, waiting..."))
resource_management.core.exceptions.Fail: The NameNode None is not listed as Active or Standby, waiting...
===
Root Cause: Restarting namenode was not able to populate the namenode_id which is required to detect the status of namenode as active / standby Resolution: 1.The dfs.namenode.rpc-address.<cluster-name>.<nn-id> was set to an IP address instead of host name and hence the namenode_id was set to None. We see if we can also check for ip address apart for hostname to retrieve the namenode_id. Below is from the code - # Values for the current Host
namenode_id = None
namenode_rpc = None
dfs_ha_namemodes_ids_list = []
other_namenode_id = None
if dfs_ha_namenode_ids:
dfs_ha_namemodes_ids_list = dfs_ha_namenode_ids.split(",")
dfs_ha_namenode_ids_array_len = len(dfs_ha_namemodes_ids_list)
if dfs_ha_namenode_ids_array_len > 1:
dfs_ha_enabled = True
if dfs_ha_enabled:
for nn_id in dfs_ha_namemodes_ids_list:
nn_host = config['configurations']['hdfs-site'][format('dfs.namenode.rpc-address.{dfs_ha_nameservices}.{nn_id}')]
if hostname in nn_host:
namenode_id = nn_id
namenode_rpc = nn_host 2. And tried continious shuffling namenodes failover using “hdfs haadmin -failover” and which worked to resolve the HDFS issue and the upgrade proceeded further [You need to shuffle namenode using "hdfs haadmin -failover" from nn1 to nn2 and vice versa till ambari restart process is ongoing to detect status of namenode]
... View more
Labels:
11-17-2016
03:31 PM
Thank you
... View more
11-17-2016
12:45 AM
6 Kudos
ISSUE:
1. While performing HDP upgrade from 2.4.2 to 2.5.0 and which failed on last step Finalize upgrade saying few hosts are not able to upgrade to latest version.
2. Tried to revert the upgrade and proceeded with Downgrade option.
3. While Downgrade it prompt for Atlas and Kafka to be deleted from Cluster.
4. Deleted Atlas and Kafka from cluster.
5. Proceeding further the stopping of service failed on to stop KAFKA
6. The downgrade screen was paused and not it was showing in Ambari UI details tab that it "Failed to start KAFKA_BROKER"
ROOT CAUSE:
Looks like there are two tasks stuck in the PENDING state from the downgrade
RESOLUTION:
1. You need to check at what step the upgrade is stuck using below command -
http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/
2. Pick the latest "request_id" from above output and execute -
http://<ambari_host>:8080/api/v1/clusters/<clustername>/upgrades/<request_id>;
In my case request_id was 858
3. Login to ambari database and use below command with "request_id" to check task which are not in COMPLETED status in host_role_command table as shown below -
ambari=> SELECT task_id, status, event, host_id, role, role_command, command_detail, custom_command_name FROM host_role_command WHERE request_id = 858 AND status != 'COMPLETED' ORDER BY task_id DESC
8964, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, RESTART KAFKA/KAFKA_BROKER, RESTART
8897, PENDING, 4, KAFKA_BROKER, CUSTOM_COMMAND, STOP KAFKA/KAFKA_BROKER, STOP
4. Update the status of the above task to COMPLETED using below command -
UPDATE host_role_command SET status = 'COMPLETED' WHERE request_id = 858 AND status = 'PENDING';
After which it was able to proceed with Downgrade.
... View more
11-16-2016
11:33 AM
7 Kudos
ISSUE: Hive view is not working. ERROR: H100 Unable to submit statement show databases like '*': org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
ROOT CAUSE: Issue was with mysql pool connection size limit exceeded.Check using - mysql> SHOW VARIABLES LIKE "max_connections";
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 100 |
+-----------------+-------+
1 row in set (0.00 sec)
RESOLUTION: Modified mysql pool size limit from 100 to 500 and restarted mysql which resolved the issue. mysql> SET GLOBAL max_connections = 500;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW VARIABLES LIKE "max_connections";
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 500 |
+-----------------+-------+
1 row in set (0.00 sec)
... View more
Labels:
11-16-2016
11:33 AM
7 Kudos
SYMPTOM: Standby NN crashing due to edit log corruption and complaining that OP_CLOSE cannot be applied because the file is not under-construction
ERROR: 2016-09-30T06:23:25.126-0400 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs, replication=3, mtime=1475223680193, atime=1472804384143, blockSize=134217728, blocks=[blk_1243879398_198862467], permissions=gsspe:148973_psdbpe:rwxrwxr-x, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=1585682886]
java.io.IOException: File is not under construction: /appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:436)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:679)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1022)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:741)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:536)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:595)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:762)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:746)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1438)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1504)
ROOT CAUSE: Edit log corruption can happen if append fails with a quota violation. This is BUG
https://issues.apache.org/jira/browse/HDFS-7587
https://hortonworks.jira.com/browse/BUG-56811
https://hortonworks.jira.com/browse/EAR-1248
RESOLUTION: 1. Stop everything
2. Backup the "current" folder of every journalnodes of the cluster
3. Backup the "current" folder of every namenodes of the cluster
4. Use the oev command to convert the binary editlog file into xml
5. Remove the record corresponding to the TXID mentioned in the error
6. Use the oev command to convert the xml editlog file into binary
7. Restart the active namenode
8. I got an error saying there was a gap in the editlogs
9. Take the keytab for the service nn/<host>@<REALM>
10. Execute the command hadoop namenode -recover
11. Answer "c" when the problem of gap occured
12. Then I saw other errors similar to the one I encountered at the beginning (the file not under construction issue)
13. I had to run the command hadoop namenode recover twice in order to get rid of these errors
14. Zookeeper servers were already started, so I started the journalnodes, the datanodes, the zkfc controllers and finally the active namenode
15. Some datanodes were identified as dead. After some investigations, I figured it was the information in zookeeper which were empty, so I restarted zookeeper servers and after the active namenode was there.
15. I started the standby namenode but it raised the same errors concerning the gap in the editlogs.
16. Being the user hdfs, I executed on the standby namenode the command hadoop namenode -bootstrapStandby -force
17. The new FSimage was good and identical to the one on the active namenode
18. I started the standby namenode successfully
19. I launched the rest of the cluster
Also check recovery option given in link - Namenode-Recovery
... View more
Labels: