Member since
07-17-2019
738
Posts
433
Kudos Received
111
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4264 | 08-06-2019 07:09 PM | |
| 4222 | 07-19-2019 01:57 PM | |
| 6083 | 02-25-2019 04:47 PM |
05-04-2023
02:43 AM
Hi Team, i have added Hbase client and phoenix core jar which has hbase client jar within it. I have the hbase and core site xmls. i get the Hconnection established message. But then i get "MESSAGE":"Reading reply sessionid:0x200239e26e51663, packet:: clientPath:/hbase serverPath:/hbase finished:false header:: 16,8 replyHeader:: 16,313534022349,0 request:: '/hbase,F response:: v{'meta-region-server,'rs,'splitWAL,'backup-masters,'flush-table-proc,'master-maintenance,'online-snapshot,'switch,'master,'running,'tokenauth,'draining,'namespace,'hbaseid,'table What could be the issue
... View more
06-01-2017
05:00 PM
20 Kudos
I was recently involved with, quite possibly, the worst HBase performance debugging issue so far in my lifetime. The issue first arose with a generic problem statement: after X hours of processing, tasks accessing HBase begin to take over 10 times longer than prior. Upon restarting HBase, performance returned to expected levels. There were no obvious errors in HBase logs, HDFS logs, or the host's syslog. This problem would manifest itself on a near-constant period: every X hours after restart. It affected different types of client tasks (those reading and writing), and was not limited to a specific node or set of nodes. Strangely, despite all inspection of HBase logs and profiling information, HBase seemed to be functioning perfectly fine. Just, slower. This lead us to investigate numerous operating system configuration changes and monitoring, none of which completely described the circumstances and symptoms of the problem. After many long days of investigation and some JVM options, we stumbled onto the first answer which satisfied (or, at least, didn't invalidate) the circumstances: a known, unfixed bug in Java 7 in which the JIT code compilation is disabled after the JIT's code cache executes a flush to reclaim space. https://bugs.openjdk.java.net/browse/JDK-8051955 The JIT (just-in-time) compiler runs behind the scenes in Java compiling Java byte-code into native machine code. Code compilation is a tool designed to help long-lived Java applications run fast without negatively affecting the start-up time of short-lived applications. After methods are invoked, they are compiled from Java byte code into machine code and cached by the JVM. Subsequent invocations of a method which are cached can directly invoke the machine code instead of having to deal with Java byte-code. Analysis:
On a 64-bit JVM with Java 7, this cache has a size of 50MB which is a sufficient amount of size for most applications. Methods which are not used frequently are evicted from this cache; this helps avoid the JVM from quickly reaching the limit. However, with sufficient time, this cache can still become full and trigger a temporary halting of JIT compilation and caching to flush the cache. However in Java 7, there is an unresolved issue in that JIT compilation is not re-enabled after the code cache is flushed. While the process continues to run, no machine code will be cached which means that code is constantly being re-compiled from byte code into machine code. We were able to confirm that this is what was happening by enabling two JVM options for the HBase services in hbase-env.sh:
-XX:+PrintCompilation
-XX:+PrintSafepointStatistics
The first option prints a log message for every compilation, every method marked as "not entrant" (the method is candidate to be removed from the cache), and every method marked as "zombie" (removed from the cache). This is helpful in determining when the JIT compilation is happening. The second option prints debugging information about JVM safepoints which are invoked. A JVM safepoint scan be thought of as a low-level "lock" -- the safepoint is taken to provide mutual exclusion at the JVM level. A common use for enabling this option is to analyze the frequency and time taken by garbage collection operations. For example, the concurrent-mark-and-sweep (CMS) collector takes safepoints for various points in its execution. When the code cache becomes full and a flushing event occurs, a safepoint is taken named "HandleFullCodeCache". The combination of these two options can show that a Java process is performing JIT compilation up until the point that the "HandleFullCodeCache" safepoint is executed, and then not further JIT compilation happens after that point. In our case, the time after JIT compilation was not happening was near within one hour of when the tasks reportedly began to see performance issues. In our case, we did not observe the following log message which was meant to make this obtuse issue more obvious. We missed it because we were working remotely and on a decent sized installation which made it not feasible to collect and analyze all logs: Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled. Solution: There are two solutions to this problem: one short-term and one long-term. The short-term solution is to increase the size of the JVM Code Cache from the default of 50MB on 64-bit JVMs. This can be accomplished via the -XX:ReservedCodeCacheSize JVM option. Increasing this to a larger value can ultimately prevent the code cache from ever becoming completely full. export HBASE_SERVER_OPTS="$HBASE_SERVER_OPTS -XX:ReservedCodeCacheSize=256m" On HDP releases <=2.6, it is necessary to set HBASE_REGIONSERVER_OPTS variable explicitly instead. export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:ReservedCodeCacheSize=256m" The implication of this configuration is that it would remove available on-heap memory, but this is typically quite minor (100's of MB when we typically consider 1's of GB). The long-term solution is to upgrade to Java 8. Java 7 is long end-of-life'ed by Oracle and this is a prime example of known issues which were never patched in Java 7. It is strongly recommended that any user still on Java 7 have a plan to move to Java 8 as soon as possible. No other changes would be required on Java 8 as it is not subject to this bug.
... View more
04-05-2017
04:27 PM
4 Kudos
The Phoenix Query Server is an HTTP server which expects very specific request data data. Sometimes, in the process of connecting different clients, the various configuration options of both client and server can create confusion about what data is actually being sent over the wire. This confusion leads to questions like "did my configuration property take effect" and "is my client operating as I expect". Linux systems often have a number of tools available for analyzing network traffic on a node. We can use one of these tools, ngrep, to analyze the traffic flowing into the Phoenix Query Server. From a host running the Phoenix Query Server, the following command would dump all traffic from any source to the Phoenix Query Server. $ sudo ngrep -t -d any port 8765 The above command will listen to any incoming network traffic on the current host and filter out any traffic which is not to the port 8765 (the default port for the Phoenix Query Server). A specific network interface (e.g. eth0) can be provided instead of "any" to further filter traffic. When connecting a client to the server, you should be able to see the actual HTTP requests and responses sent between client and server. T 2017/04/05 12:49:07.041213 127.0.0.1:60533 -> 127.0.0.1:8765 [AP]
POST / HTTP/1.1..Content-Length: 137..Content-Type: application/octet-stream..Host: localhost:8765..Connection: Keep-Alive..User-Agent: Apache-HttpClient/4.5.2 (Java/1.8.0_45)..Accept-Encoding: gzip,deflate.....?org.apache.calcite.avatica.proto.Requests$OpenConnectionRequest.F.$2ba8e796-1a29-4484-ac88-6075604152e6....password..none....user..none
##
T 2017/04/05 12:49:07.052011 127.0.0.1:8765 -> 127.0.0.1:60533 [AP]
HTTP/1.1 200 OK..Date: Wed, 05 Apr 2017 16:49:07 GMT..Content-Type: application/octet-stream;charset=utf-8..Content-Length: 91..Server: Jetty(9.2.z-SNAPSHOT).....Aorg.apache.calcite.avatica.proto.Responses$OpenConnectionResponse......hw10447.local:8765
## The data above is in ProtocolBuffers which is not a fully-human readable format; however, "string" data is stored as-is which makes reading it a reasonable task.
... View more
Labels:
03-10-2017
11:26 PM
A common error to see in initial installations is the following from Accumulo TabletServer logs Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /apps/accumulo/data/wal/myhost.mydomain.com+9997/1ff916a2-13d0-4bb7-aa38-c44b69831519 could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1649)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:843)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552)
at org.apache.hadoop.ipc.Client.call(Client.java:1496)
at org.apache.hadoop.ipc.Client.call(Client.java:1396)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
This exception will be printed repeatedly in the TabletServer logs as Accumulo has no other solution than to try to create its write-ahead log file again. This exception is, indirectly, telling us multiple things about the current state: There are three Datanodes None of the Datanodes were avoided -- this means all three of them should have been able to accept the write None of the Datanodes successfully accepted the write The most common cause of this issue is that each Datanode has a very small amount of disk space to use. When Accumulo creates its write-ahead log files, it sets a large HDFS block size (by default: 1GB). If the Datanode does not have enough free space to store 1GB of data, the allocation fails. When all of the Datanodes are in this situation, you would see the above error message. The solution to the above problem is to provide more storage for the Datanode. Commonly, this is because HDFS is not configured to use the correct data directories or some hard drives were not mounted to the data dirs (and thus the Datanodes are using the root volume).
... View more
Labels:
12-24-2018
05:38 PM
I followed this with HDP 2.6.5 and the HBaseUI became accessible in the given URL but has many errors and links not working inside. I posted a question on how to fix this and then the answer resolving most of these issues here: https://community.hortonworks.com/questions/231948/how-to-fix-knox-hbase-ui.html You are welcome to test this and include these fixes in your article if you find it appropriate. Best regards
... View more
04-16-2017
01:28 AM
Use `jstack` to identify why the init process is hanging. Most likely you do not have correct accumulo-site.xml or ZooKeeper or HDFS are not running.
... View more
01-10-2017
07:45 PM
1 Kudo
When executing Step 3 of the Ambari installation wizard "Confirm Hosts", Ambari will (by default) SSH to each node and start an instance of the Ambari Agent process. In some cases, it is possible that the local RPM database is corrupted and this registration process will fail. The error message in Ambari would look something like: INFO:root:Executing parallel bootstrap
ERROR:root:ERROR: Bootstrap of host myhost.mydomain fails because previous action finished with non-zero exit code (1)
ERROR MESSAGE: tcgetattr: Invalid argumentConnection to myhost.mydomain closed.
STDOUT: Error: database disk image is malformed
Error: database disk image is malformedDesired version (2.5.0.0) of ambari-agent package is not available.
tcgetattr: Invalid argumentConnection to myhost.mydomain closed. In this case, the local RPM database is malformed and all actions to alter the installed packages on the system will fail until the database is rebuilt. This can be done by the following commands as root on the host reporting the error: [root@myhost ~] # mv /var/lib/rpm/__db* /tmp
[root@myhost ~] # rpm --rebuilddb Then, click the "Retry Failed Hosts" button in Ambari and the registration should succeed.
... View more
Labels:
09-24-2018
12:21 PM
How to connect remote EC2 HDP Phoenix DB from local Spring Boot Application?
... View more
11-10-2016
06:25 PM
Nice writeup @wsalazar. I think you can simplify your classpath setup by only including the /usr/hdp/current/phoenix-client/phoenix-client.jar and the XML configuration files (core-site, hdfs-site, hbase-site). The phoenix-client.jar will contain all of the classes necessary to connect to HBase using the Phoenix (thick) JDBC driver.
... View more
08-12-2016
04:27 AM
11 Kudos
Apache ZooKeeper is a “high-performance coordination service
for distributed applications.” Most users do not use ZooKeeper directly; however,
most users are also hard-pressed to deploy a Hadoop-based architecture that
doesn’t rely on ZooKeeper in some way. With its prevalence in the data-center, resource
management within ZooKeeper is paramount to ensure that the various
applications and services relying on ZooKeeper are able to access it in a
timely manner. To this end, one of ZooKeeper’s protection mechanisms is known
as “max client connections” or “
maxClientCnxns”.
maxClientCnxns refers to a configuration property that can
be added to the zoo.cfg configuration file. This property limits the number of
active connections from a host, specified by IP address, to a single ZooKeeper
server. By default, this limit is 60 active connections; one host is not
allowed to have more than 60 active connections open to one ZooKeeper server.
Changes to this property in the zoo.cfg file require a restart of ZooKeeper.
This is a simple way that ZooKeeper prevents clients from performing a denial
of service attack against ZooKeeper (maliciously or unwittingly) as well as
limiting the amount of memory required by these client connections.
The reason this property is so important is that it can
effectively deny all access from a host inside of a cluster to a ZooKeeper
server. This can have a severe performance and stability impacts on a cluster.
For example, if a node running an Apache HBase RegionServer hits the
maxClientCnxns limit, all future requests made by that RegionServer to that
ZooKeeper server would be dropped until the overall number of connections to
the ZooKeeper server are reduced. Perhaps the worst part about this is that
processes other than HBase running on the same node (e.g. YARN containers as a
part of a MapReduce job) could also eat into the allowed connections from the
same host.
On a positive note, it is simple to recognize when this rate
limiting is happening and also simple to determine the problematic clients on
the rate-limited host. First, there is a very clear error message in the
ZooKeeper server log which identifies the host being rate-limited and the
current active connections limit:
“Too many connections from 10.0.0.1 – max is 60”
This error message is stating that a client from the host
with IP address 10.0.0.1 is trying to connect to this ZooKeeper server, but the
limit is 60 connections. As such, the current connection will be dropped. At
this point, we know the host where these connections are coming from, but we
don’t know what applications on that host are making them. We can use a network
analysis tool such as `netstat` to determine the applications on the client host,
in this case 10.0.0.1 (let’s assume our ZooKeeper server is on 10.0.0.5):
netstat -nape | awk ‘{if ($5 == “10.0.0.5:2181”) print $4, $9;}’
This command will list the local address and process
identifier for each connection, only where the remote address is our ZooKeeper
server and the ZooKeeper service port (2181). Similarly, we can further group
this data to give us a count of outgoing connections by process identifier to
the ZooKeeper server.
netstat -nape | awk ‘{if ($5 == “10.0.0.5:2181”) print $9;}’ | sort | uniq –c
This command will report a count of connections to the
ZooKeeper server. This can be extremely helpful in identifying misbehaving
applications causing issues. Additionally, we can use some of the “four letter
word” commands to further give us information about the active connections to a
ZooKeeper server. Using netcat, either of the following could be used:
echo “stat” | nc 10.0.0.5 2181
echo “cons” | nc 10.0.0.5 2181
Each of these commands will output data which contains information about the active connections to the given ZooKeeper server.
To summarize, the maxClientCnxns property in zoo.cfg is used
by the ZooKeeper server to limit incoming connections to the ZooKeeper from a
single host. By default, this limit is 60. When this limit is reached, new
connections to the ZooKeeper server from the given host will be immediately
dropped. This rate-limiting can be observed in the ZooKeeper log and offending
applications can be identified by using network tools like netstat. Changes to
maxClientCnxns must be accompanied with a restart of the ZooKeeper server.
ZooKeeper configuration property documentation
ZooKeeper four letter words documentation
... View more