Member since
07-17-2019
738
Posts
433
Kudos Received
111
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4264 | 08-06-2019 07:09 PM | |
| 4222 | 07-19-2019 01:57 PM | |
| 6083 | 02-25-2019 04:47 PM |
02-07-2017
06:41 PM
s/HBASEMANTERHOST/HBASEMASTERHOST/? Users would be filling in the explicit hostname here, correct? Not the literal string "HBASEMASTERHOST".
... View more
01-23-2017
09:02 PM
1 Kudo
Sometimes, in the face of initialize configuration and setup, it is easier to completely re-initialize an Accumulo installation than try to repair it. These steps are different than what might be found in the Apache ecosystem as the following steps are related to how Apache Ambari is used to install and configure Accumulo. Warning: the following steps (may) irreversibly remove all Accumulo related data. This includes table data, namespaces, table configuration, and Accumulo users. Do not perform these steps unless you are positive that you do not want to preserve any of the information. First, Accumulo must be stopped. This can be done via Ambari and verified using tools like `ps` on the nodes. Second, the Accumulo HDFS directory should be removed. This can be done by the HDFS superuser ("hdfs" by default) or the Accumulo user ("accumulo" by default). # sudo -u hdfs hdfs dfs -rm -R /apps/accumulo/data Next, Accumulo needs to re-initialized using the command line tools. This command must be executed from a node in your cluster where an Accumulo service is currently installed (the Accumulo Client is not sufficient). # sudo -u accumulo ACCUMULO_CONF_DIR=/etc/accumulo/conf/server accumulo init --instance-name hdp-accumulo-instance --clear-instance-name This command requires two pieces of information. The first we are providing as an argument to the command: the Accumulo instance name. By default, this name is "hdp-accumulo-instance" using Ambari; however, users may have provided their own value. As this name is how clients find and connect to Accumulo, it is important to use the correct name. The second piece of information is the Accumulo root user's password. You will be prompted for this information after running this command. This is only relevant when Kerberos authentication is not configured. When Kerberos is enabled, this command will prompt you for the full Kerberos principal of the user to grant Accumulo administrative (SYSTEM) permissions to. If this command successfully returns, you can restart Accumulo via Ambari. Visit the Accumulo Monitor page to verify that the system is online and/or use the proper Accumulo credentials to access the system via the Accumulo shell.
... View more
Labels:
01-11-2017
04:40 PM
@Christopher Bridge There was a recent issue in https://issues.apache.org/jira/browse/PHOENIX-3126, but this was not included in HDP-2.5.0.0. Connections by PQS to HBase are always made by the PQS principal+keytab -- the end user is always "proxied" on top of the PQS credentials. If you have an example code which shows something happening and can describe why you think this is wrong, I'll try to take a look at it. If you are a Hortonworks customer, you can/should also reach out through support channels.
... View more
01-10-2017
07:45 PM
1 Kudo
When executing Step 3 of the Ambari installation wizard "Confirm Hosts", Ambari will (by default) SSH to each node and start an instance of the Ambari Agent process. In some cases, it is possible that the local RPM database is corrupted and this registration process will fail. The error message in Ambari would look something like: INFO:root:Executing parallel bootstrap
ERROR:root:ERROR: Bootstrap of host myhost.mydomain fails because previous action finished with non-zero exit code (1)
ERROR MESSAGE: tcgetattr: Invalid argumentConnection to myhost.mydomain closed.
STDOUT: Error: database disk image is malformed
Error: database disk image is malformedDesired version (2.5.0.0) of ambari-agent package is not available.
tcgetattr: Invalid argumentConnection to myhost.mydomain closed. In this case, the local RPM database is malformed and all actions to alter the installed packages on the system will fail until the database is rebuilt. This can be done by the following commands as root on the host reporting the error: [root@myhost ~] # mv /var/lib/rpm/__db* /tmp
[root@myhost ~] # rpm --rebuilddb Then, click the "Retry Failed Hosts" button in Ambari and the registration should succeed.
... View more
Labels:
12-12-2016
02:31 AM
2 Kudos
One quirk of Apache Phoenix when compared to traditional RDBMS is that Phoenix provides no notion of simple username/password based authentication. This largely stems from Apache HBase, which Phoenix is built on, also not providing this as a form of authentication. With the introduction of the Phoenix Query Server, we have a number of new means which can be used to interact with Phoenix. We also have the ability to hook together new systems to provide features, like username/password authentication, which are not traditionally supported.
There are multiple products available which can perform this kind of authentication, but we can trivially show that this works via a common HTTP load balancer, HAProxy. Let's assume that we have the Phoenix Query Server running on our local machine listening on the standard 8765 port. We can enable some trivial authentication using HAProxy. First, we need to create our HAProxy configuration file.
global
maxconn 256
defaults
mode http
option redispatch
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
userlist AvaticaUsers
user josh insecure-password secret
frontend avatica-http-in
bind *:9000
default_backend avaticaservers
backend avaticaservers
balance source
server queryserver1 127.0.0.1:8765 check
acl AuthOkay http_auth(AvaticaUsers)
http-request auth if !AuthOkay
The above contents can be placed into a file and then should be referenced when starting HAProxy (e.g. `haproxy -f my_auth.conf`). The result will be HAProxy listening on port 9000 and applying HTTP Basic authentication to requests before they are dispatched to the backend PQS. This example will only accept the username password combination of "josh" and "secret". Using an external authentication is left as an example to the user.
With the changes presently staged in PHOENIX-3517, we can easily connect to PQS, via HAProxy, using our username/password and then HTTP Basic authentication method.
./sqlline-thin.py -a BASIC --auth-user=josh --auth-password=secret http://localhost:9000
Similarly, using a username or password that doesn't match the configuration would result in the client receiving an HTTP/403 error and being unable to access Phoenix.
This example can be extrapolated to relevant technology like Apache Knox which provides a fully-featured authentication-gateway service and shows how we can bring username/password authentication to Apache Phoenix in the near future.
... View more
Labels:
11-10-2016
06:25 PM
Nice writeup @wsalazar. I think you can simplify your classpath setup by only including the /usr/hdp/current/phoenix-client/phoenix-client.jar and the XML configuration files (core-site, hdfs-site, hbase-site). The phoenix-client.jar will contain all of the classes necessary to connect to HBase using the Phoenix (thick) JDBC driver.
... View more
08-12-2016
04:27 AM
11 Kudos
Apache ZooKeeper is a “high-performance coordination service
for distributed applications.” Most users do not use ZooKeeper directly; however,
most users are also hard-pressed to deploy a Hadoop-based architecture that
doesn’t rely on ZooKeeper in some way. With its prevalence in the data-center, resource
management within ZooKeeper is paramount to ensure that the various
applications and services relying on ZooKeeper are able to access it in a
timely manner. To this end, one of ZooKeeper’s protection mechanisms is known
as “max client connections” or “
maxClientCnxns”.
maxClientCnxns refers to a configuration property that can
be added to the zoo.cfg configuration file. This property limits the number of
active connections from a host, specified by IP address, to a single ZooKeeper
server. By default, this limit is 60 active connections; one host is not
allowed to have more than 60 active connections open to one ZooKeeper server.
Changes to this property in the zoo.cfg file require a restart of ZooKeeper.
This is a simple way that ZooKeeper prevents clients from performing a denial
of service attack against ZooKeeper (maliciously or unwittingly) as well as
limiting the amount of memory required by these client connections.
The reason this property is so important is that it can
effectively deny all access from a host inside of a cluster to a ZooKeeper
server. This can have a severe performance and stability impacts on a cluster.
For example, if a node running an Apache HBase RegionServer hits the
maxClientCnxns limit, all future requests made by that RegionServer to that
ZooKeeper server would be dropped until the overall number of connections to
the ZooKeeper server are reduced. Perhaps the worst part about this is that
processes other than HBase running on the same node (e.g. YARN containers as a
part of a MapReduce job) could also eat into the allowed connections from the
same host.
On a positive note, it is simple to recognize when this rate
limiting is happening and also simple to determine the problematic clients on
the rate-limited host. First, there is a very clear error message in the
ZooKeeper server log which identifies the host being rate-limited and the
current active connections limit:
“Too many connections from 10.0.0.1 – max is 60”
This error message is stating that a client from the host
with IP address 10.0.0.1 is trying to connect to this ZooKeeper server, but the
limit is 60 connections. As such, the current connection will be dropped. At
this point, we know the host where these connections are coming from, but we
don’t know what applications on that host are making them. We can use a network
analysis tool such as `netstat` to determine the applications on the client host,
in this case 10.0.0.1 (let’s assume our ZooKeeper server is on 10.0.0.5):
netstat -nape | awk ‘{if ($5 == “10.0.0.5:2181”) print $4, $9;}’
This command will list the local address and process
identifier for each connection, only where the remote address is our ZooKeeper
server and the ZooKeeper service port (2181). Similarly, we can further group
this data to give us a count of outgoing connections by process identifier to
the ZooKeeper server.
netstat -nape | awk ‘{if ($5 == “10.0.0.5:2181”) print $9;}’ | sort | uniq –c
This command will report a count of connections to the
ZooKeeper server. This can be extremely helpful in identifying misbehaving
applications causing issues. Additionally, we can use some of the “four letter
word” commands to further give us information about the active connections to a
ZooKeeper server. Using netcat, either of the following could be used:
echo “stat” | nc 10.0.0.5 2181
echo “cons” | nc 10.0.0.5 2181
Each of these commands will output data which contains information about the active connections to the given ZooKeeper server.
To summarize, the maxClientCnxns property in zoo.cfg is used
by the ZooKeeper server to limit incoming connections to the ZooKeeper from a
single host. By default, this limit is 60. When this limit is reached, new
connections to the ZooKeeper server from the given host will be immediately
dropped. This rate-limiting can be observed in the ZooKeeper log and offending
applications can be identified by using network tools like netstat. Changes to
maxClientCnxns must be accompanied with a restart of the ZooKeeper server.
ZooKeeper configuration property documentation
ZooKeeper four letter words documentation
... View more
06-22-2016
01:31 PM
4 Kudos
One of the most common questions I come across when trying to help debug MapReduce jobs is: "How do I change the Log4j level for my job?" Many times, a user has a JAR with a class that implements Tool that they invoke using the hadoop jar command. The desire is to change the log level without changing any code or global configuration files: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob <args ...> There is an extremely large amount of misinformation because how to do
this has drastically changed from the 0.20.x and 1.x Apache Hadoop days.
Most posts will inform you of some solution involving environment
variables or passing Java opts to the mappers/reducers. In practice, there is actually a very straightforward solution. To change the Mapper Log4j level, set mapreduce.map.log.level. To change the Reducer Log4j level, set mapreduce.reduce.log.level. If for some reason you need to change the Log4j level on the MapReduce ApplicationMaster (e.g. to debug InputSplit generation), you need to set yarn.app.mapreduce.am.log.level. This is the proper way for the Apache Hadoop 2.x release line. These options do not allow configuration of a Log4j level on a certain class or package -- this would require custom logging setup to be provided by your application. It's important to remember that you are able to define configuration properties (which will appear in your job via the Hadoop Configuration) using the `hadoop jar` command: hadoop jar <jarfile> <classname> [-Dkey=value ...] [arg, ...] The `-Dkey=value` section can be used to define the Log4j configuration properties when you launch the job. For example, to set the DEBUG Log4j level on Mappers: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dmapreduce.map.log.level=DEBUG <args ...> To set the WARN Log4j level on Reducers: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dmapreduce.reduce.log.level=WARN <args ...> To set the DEBUG Log4j level on the MapReduce Application Master: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dyarn.app.mapreduce.am.log.level=DEBUG <args ...> And, of course, each of these options can be used with one another: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dmapreduce.map.log.level=DEBUG -Dmapreduce.reduce.log.level=DEBUG -Dyarn.app.mapreduce.am.log.level=DEBUG <args ...>
... View more
Labels:
06-20-2016
01:25 PM
8 Kudos
There are many situations in which running benchmarks for certain workloads on Apache Phoenix can provide meaningful insight into an installation. Commonly, such a benchmark is very useful to understand the baseline characteristics of a new installation of Apache Phoenix. Alternatively, the ability to re-run the same benchmark after changing a
configuration property change is extremely useful in understanding the effect of that change. Many approaches exist to
test systems that have a SQL interface, many of them focused on a specific type of workload. The following approaches
aim to describe a few benchmarks which users can run on their own and tweak to a workload which makes sense for their
cluster. Apache JMeter Automation
Apache JMeter is a tool which was initially designed to test web applications; however, it also has the ability to execute SQL queries against some JDBC Driver. JMeter allows us to define queries to execute against a Phoenix table.
An instance of JMeter can run multiple threads of queries in parallel, with multiple instances of JMeter capable
of spreading clients across many nodes. The queries can also be parameterized with pseudo-random data in order to simulate all
types of queries to a table.
https://github.com/joshelser/phoenix-performance is a project (originally based on https://github.com/cartershanklin/phoenix-performance and https://github.com/ndimiduk/phoenix-performance) which bulk ingests
data to Phoenix and then reads the data back using JMeter. The data-generation is done by TPC-DS and can scale from small to large to generate an appropriate level of data
for the cluster being tested. This is accomplished via a MapReduce job which creates HBase HFiles which are then bulk-imported directly
into HBase. This approach is the most efficient to ingest a large amount of data into HBase.
A number of example queries are also provided which vary in the style of the query, e.g. point queries or range-scan queries. JMeter automates the the execution of the queries in parallel. The results of the queries that JMeter ran are also
aggregated and analyzed together to provide an overall view into the performance. Mean and median are provided for
a simple insight, as well as 90th, 95th, 99th and 99.9th percentiles to understand the execution tail.
This approach is extremely useful to execute read-heavy workloads. Indexes can be created over the original TPC-DS
dataset to mimic your real datasets. The provided queries are only a starting point and can be easily expanded to
any other type of query.
The provided README file gives general instructions to generating and querying the data. Apache Phoenix Pherf
Pherf is a tool which Apache Phoenix provides out of the box to test both read and write performance. It also aims to
provide some means for verifying correctness, but this feature is a bit lacking, being hard to test correctness in ways
other than record counts.
Pherf requires two things to run a test: a schema and a scenario. A schema is some SQL file defining DDL (data
definition langauge) for some table(s) or index(es). The scenario defines both the write and read tests to execute
against those tables defined in the schema. On the write-side, like the JMeter support, Pherf also supports the generation
of pseudo-random data to populate into the tables. In addition to purely random data, Pherf also has the ability to
specify data to write with given probabilities. The scenario then defines the number of records which should be
inserted into the table given the rules on the data generation. On the read-side, Pherf allows the definition of queries
and the expected outcome of those queries to run be on the tables which were just populated.
Pherf can collect metrics about the scenario being executed, but the results are not aggregated and presented for human
consumption.
Like the JMeter tests, Pherf can be parallelized across many nodes in a cluster to test the system under concurrent
user load. There are many other options available to Pherf. The official documentation can be found at
https://phoenix.apache.org/pherf.html. Some automation software which tries to handle the installation and execution of Pherf
is also available at
https://github.com/joshelser/phoenix-pherf-automation. YCSB
The "Yahoo! Cloud Serving Benchmark"
https://github.com/brianfrankcooper/YCSB is well-known benchmarking software in the
database field. YCSB has many bindings for both SQL and NoSQL databases, commonly being used directly by Apache HBase
for performance testing. YCSB has workloads which define how data is written and read from the tables in the database. A
workload defines the number of records/operations, the ratio of reads/updates/scans/inserts, and the distribution (e.g.
Zipfian) of data to generate. YCSB doesn't provide fine-grained control over the type of data to generate via
configuration (like JMeter and Pherf do), but this can be nice to not have to configure (using the provided YCSB
workloads as "standard" workloads).
Like all of the above, YCSB can be executed on one node or run concurrently across many nodes. The result of the
benchmark are reported very similarly to what the JMeter approach does (mean, median, and percentiles), but is probably
the most detailed.
YCSB does require some modifications to run against Apache Phoenix (as Phoenix doesn't support the traditional "INSERT"
command). Long term, this modifications will likely land upstream to ease use of YCSB against Phoenix. Summary In conclusion, there are a number of tools available to use to understand the performance of Apache Phoenix. For any user, having a representative benchmark for your specific workloads is an extremely important tool in running a cluster. These kinds of benchmarks let you evaluate the performance of your cluster as your change application and operating-system configurations. Benchmarks do require a bit of effort to understand what the results report. The results should always be looked at critically to ensure the numbers are sensible and that you understand why the results are what they are. All users, whether new or old, should strongly consider investing time into finding the right benchmark for their Apache Phoenix application if they do not already have one.
... View more
Labels:
06-09-2016
04:07 PM
1 Kudo
Nice table -- what version of HDP did you base it on?
... View more
- « Previous
-
- 1
- 2
- Next »