Member since
08-03-2017
18
Posts
0
Kudos Received
0
Solutions
02-04-2022
03:26 PM
@rpathak Thank you for your response! I have tried increasing memory per llap daemon upto 87 GB currently, but every time containers are being killed with the same reason, physical memory limit being reached. Do you think I need to increase the memory even more?
... View more
02-04-2022
12:30 PM
Hi Everyone, We have a situation where yarn is killing llap application containers and then requesting to launch new ones. This causes a brief unavailability in llap daemons and running applications fail because of this. When we reviewed some of the container logs, saw following message: 2022-01-26 03:15:48,339 [Component dispatcher] ERROR instance.ComponentInstance - [COMPINSTANCE llap-0 : container_e127_1642817883045_7610_01_000002]: container_e127_1642817883045_7610_01_000002 completed. Reinsert back to pending list and requested a new container. exitStatus=-104, diagnostics=[2022-01-26 03:15:47.314]Container [pid=8434,containerID=container_e127_1642817883045_7610_01_000002] is running 665411584B beyond the 'PHYSICAL' memory limit. Current usage: 75.6 GB of 75 GB physical memory used; 77.6 GB of 157.5 GB virtual memory used. Killing container. I don't understand from where this 75.6GB of 75 GB limit is coming from? I have tried increasing the memory per llap daemon but it doesn't help either. Parameters: 1. Memory allocated for all yarn containers on a node is = 95 GB 2. llap memory per daemon = 75 Gb 3. memory cache per daemon = 20 Gb 4. llap_daemon_overhead= 6 GB Hive servers2 or hive-interacative-server logs don't provide much detail either. What other properties I can fine tune to fix this? Any help is appreciated.
... View more
Labels:
- Labels:
-
Apache Hive
01-11-2022
02:58 PM
Hi Everyone, I need to delete hbase znode from zokkeeper CLI but getting follwing error when I try that: [zk: localhost:2181(CONNECTED) 1] rmr /hbase-secure Authentication is not valid : /hbase-secure/replication [zk: localhost:2181(CONNECTED) 2] This is the ACL set: [zk: localhost:2181(CONNECTED) 11] getAcl /hbase-secure 'world,'anyone : r 'sasl,'hbase : cdrwa I have even tried to set the ACL as follows but it didn't help either: [zk: localhost:2181(CONNECTED) 7] setAcl /hbase-secure world:anyone:cdrwa Authentication is not valid : /hbase-secure [zk: localhost:2181(CONNECTED) 8] rmr /hbase-secure Authentication is not valid : /hbase-secure/replication The Cluster is kerberized, what am I missing here? Appreciate all the help!
... View more
Labels:
- Labels:
-
Apache HBase
10-18-2021
01:15 PM
Hi Everyone, Wanted to check your opinion on one backup and DR situation. So for one of our Dev Platform, we are still on CDH 5.4 version (with EOD already passed not under technical support scope), all the servers for this platform are going to be physically migrated from one Data Center to another Data Center. These 34 servers are placed in three separate racks, so considering in-built replication and Rack awareness, so technically I believe as long as we loose few servers and even one single Rack, we should still not be facing any data loss. Now considering the worst case scenario that multiple Racks or all servers are damaged during migration, I need to have a backup and DR solution in place. To consider few limitations, we don’t have architectural capacity to spin up separate cluster to copy existing hdfs data there for recovery purpose. Currently we are thinking to backup all servers using Avamar backup tool, all OS data and data disks we can backup. I am going to take Namenode metadata backup, and backend database backup for applicable services separately as well. My confusion is about HDFS data recovery, so let’s say if for all data nodes we are taking backup of all DataNode directories (all hard disks) and in case something catastrophic happens during migration and we have to rebuild the cluster again, considering we have backup of namenode metadata and all DataNode directories(block-pool structure in each DataNode), we should be able to completely recover the HDFS data, is this understanding correct? Or am I missing some technical details here? Also, what other options we can use to plan DR around HDFS data?
... View more
Labels:
- Labels:
-
HDFS
07-28-2021
02:29 PM
Hello Everyone - We have Hive job which performs a Merge operation and encountered below error:
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.io.IOException: java.lang.AssertionError: Invalid decRef when refCount is 0: 0xc7917fe(0)
I have gone through the below JIRA, as it mentioned we are on HDP 3.1.0:
https://issues.apache.org/jira/browse/HIVE-17411
Any pointers is appreciated.
... View more
Labels:
03-23-2021
11:30 AM
@smdas Hi! So clearing out root directory for Hbase in zookeeper and using the backup of just "/apps/hbase/data/data" directory is not going to pop up the required Namespace and tables. So for data recovery using HFiles, following steps need to be done, correct? 1. Start Hbase on a new data directory. 2. Create the required Namespace and table structure in Hbase, then copy the HFiles from backup for all tables in respective location. 3. So just copying the HFiles is enough or after this do we need to run "completebulkoad" utility for all tables from the copied HFiles? Problem: I suspect with this approach we still would require offline meta repair " hbase hbck -repair" which is not available with the HDP version we have. Please let me know your thoughts.
... View more
03-19-2021
01:35 PM
@smdas Appreciate your response! One last thing, do these steps look correct to you for the recovery process: Take backup of HBase data directory residing in HDFS - “/apps/hbase/data” Stop the Hbase service. Connect with zookeeper client and delete the hbase root directory : hbase zkcli delete /hbase-secure Start the Hbase service Once the service is online, stop just the HBase masters. Copy the “/apps/hbase/data/data” from the backup to current “/apps/hbase/data/data” HDFS location. Start the Hbase Masters. Verify if all the namespaces and tables are present that existed earlier. Thank you so much for all the help!
... View more
03-19-2021
07:45 AM
@smdas Thank you for your response! It is a Production Cluster and that's why we don't want to re-initialize Hbase. Is there any other way to recover from this? Also, is the modified Hbase client and Hbase-server jar available for download?
... View more
03-18-2021
10:10 AM
I have a situation where my namespace system table is not online and because of that I’m seeing these messages in HBase master log: 2021-03-17 20:29:54,614 WARN [Thread-18] master.HMaster: hbase:namespace,,1575575842296.0c72d4be7e562a2ec8a86c3ec830bdc5. is NOT online; state={0c72d4be7e562a2ec8a86c3ec830bdc5 state=OPEN, ts=1616010947554, server=itk-phx-prod-compute-6.datalake.phx,16020,1615483461273}; ServerCrashProcedures=false. Master startup cannot progress, in holding-pattern until region onlined. I came across this article for fixing this problem: https://docs.cloudera.com/runtime/7.2.7/troubleshooting-hbase/topics/hbase_running_hbck2.html But while following the article and running suggested command, running into following problem: getting “Failed to specify server's Kerberos principal name” error. I need clarification on following two points: Do we need any specific format to run hbck2 utility if the cluster is kerberized? I.e if the principal needs to be passed as an external parameter? I even tried passing hbase configurations with --config option which wasn't an acceptable option. Has anyone else faces similar issue with Hbase system table and fixed it using a different approach? ========================================== [root@itk-phx-prod-edge-1 ~]# kinit -kt /etc/security/keytabs/hbase.headless.keytab hbase [root@itk-phx-prod-edge-1 ~]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hbase@PROD.DATALAKE.PHX Valid starting Expires Service principal 03/18/2021 16:45:53 03/19/2021 16:45:53 krbtgt/PROD.DATALAKE.PHX@PROD.DATALAKE.PHX =========================================== [root@itk-phx-prod-edge-1 target]# hbase hbck -j hbase-hbck2-1.2.0-SNAPSHOT.jar -s assigns hbase:namespace 1575575842296.0c72d4be7e562a2ec8a86c3ec830bdc5 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/hbase-hbck2/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.2.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/phoenix/phoenix-5.0.0.3.1.0.0-78-server.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 16:47:07.894 [main] INFO org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient - Connect 0x560348e6 to itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181 with session timeout=90000ms, retries 6, retry interval 1000ms, keepAlive=60000ms 16:47:07.962 [ReadOnlyZKClient-itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181@0x560348e6-SendThread(itk-phx-prod-zk-2.datalake.phx:2181)] WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: Zookeeper client cannot authenticate using the Client section of the supplied JAAS configuration: '/usr/hdp/current/hbase-client/conf/hbase_regionserver_jaas.conf' because of a RuntimeException: java.lang.SecurityException: java.io.IOException: /usr/hdp/current/hbase-client/conf/hbase_regionserver_jaas.conf (No such file or directory) Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 16:47:08.253 [main] INFO org.apache.hbase.HBCK2 - Skipped assigns command version check; 'skip' set 16:47:08.838 [main] INFO org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient - Close zookeeper connection 0x560348e6 to itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181 Exception in thread "main" java.io.IOException: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name at org.apache.hadoop.hbase.client.HBaseHbck.assigns(HBaseHbck.java:111) at org.apache.hbase.HBCK2.assigns(HBCK2.java:308) at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:819) at org.apache.hbase.HBCK2.run(HBCK2.java:777) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hbase.HBCK2.main(HBCK2.java:1067) Caused by: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:336) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:95) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:571) at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$BlockingStub.assigns(MasterProtos.java) at org.apache.hadoop.hbase.client.HBaseHbck.assigns(HBaseHbck.java:106) ... 6 more Caused by: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185) I can attach the complete Hbase master log as well if that helps.
... View more
Labels:
- Labels:
-
Apache HBase
03-18-2021
10:02 AM
@smdas Thank you for your response. One more question regarding running HBCK2 utility in a kerberized environment. I am getting this "Failed to specify server's Kerberos principal name", even though I'm authenticated as hbase principal. Could you please let me know if the principal needs to be passed as an external parameter? I even tried passing hbase configurations with --config option which wasn't an acceptable option. ========================================== [root@itk-phx-prod-edge-1 ~]# kinit -kt /etc/security/keytabs/hbase.headless.keytab hbase [root@itk-phx-prod-edge-1 ~]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hbase@PROD.DATALAKE.PHX Valid starting Expires Service principal 03/18/2021 16:45:53 03/19/2021 16:45:53 krbtgt/PROD.DATALAKE.PHX@PROD.DATALAKE.PHX =========================================== [root@itk-phx-prod-edge-1 target]# hbase hbck -j hbase-hbck2-1.2.0-SNAPSHOT.jar -s assigns hbase:namespace 1575575842296.0c72d4be7e562a2ec8a86c3ec830bdc5 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/hbase-hbck2/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.2.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/phoenix/phoenix-5.0.0.3.1.0.0-78-server.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 16:47:07.894 [main] INFO org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient - Connect 0x560348e6 to itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181 with session timeout=90000ms, retries 6, retry interval 1000ms, keepAlive=60000ms 16:47:07.962 [ReadOnlyZKClient-itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181@0x560348e6-SendThread(itk-phx-prod-zk-2.datalake.phx:2181)] WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: Zookeeper client cannot authenticate using the Client section of the supplied JAAS configuration: '/usr/hdp/current/hbase-client/conf/hbase_regionserver_jaas.conf' because of a RuntimeException: java.lang.SecurityException: java.io.IOException: /usr/hdp/current/hbase-client/conf/hbase_regionserver_jaas.conf (No such file or directory) Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 16:47:08.253 [main] INFO org.apache.hbase.HBCK2 - Skipped assigns command version check; 'skip' set 16:47:08.838 [main] INFO org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient - Close zookeeper connection 0x560348e6 to itk-phx-prod-zk-1.datalake.phx:2181,itk-phx-prod-zk-2.datalake.phx:2181,itk-phx-prod-zk-3.datalake.phx:2181 Exception in thread "main" java.io.IOException: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name at org.apache.hadoop.hbase.client.HBaseHbck.assigns(HBaseHbck.java:111) at org.apache.hbase.HBCK2.assigns(HBCK2.java:308) at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:819) at org.apache.hbase.HBCK2.run(HBCK2.java:777) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hbase.HBCK2.main(HBCK2.java:1067) Caused by: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:336) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:95) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:571) at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$BlockingStub.assigns(MasterProtos.java) at org.apache.hadoop.hbase.client.HBaseHbck.assigns(HBaseHbck.java:106) ... 6 more Caused by: java.io.IOException: Call to itk-phx-prod-master-2.datalake.phx/192.168.15.180:16000 failed on local exception: java.io.IOException: Failed to specify server's Kerberos principal name at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185) Really appreciate any insight to this.
... View more