About elserj

elserj · ‎06-29-2016

The efficiency of FECHAOPRCNF or CODNRBEENF as the leading column in the rowkey might depend on the cardinality of distinct values. If you have many distinct CODNRBEENF values, you can efficiently prune a large portion of your table. Conversely, if you have very few records in your date range in FECHAOPRCNF, it may make sense to leave that as the leading column in the rowkey. Either way, you can also use SALT_BUCKETS to make sure you get a good degree of parallelism.

elserj · ‎06-23-2016

The zkUrl you provided the first time should be correct: "zk1-titanu:2181/hbase-unsecure" By default, HBase on HDP will use /hbase-unsecure. If you enable Kerberos authentication, it will use /hbase-secure instead.

elserj · ‎06-22-2016

One of the most common questions I come across when trying to help debug MapReduce jobs is: "How do I change the Log4j level for my job?" Many times, a user has a JAR with a class that implements Tool that they invoke using the hadoop jar command. The desire is to change the log level without changing any code or global configuration files: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob <args ...> There is an extremely large amount of misinformation because how to do this has drastically changed from the 0.20.x and 1.x Apache Hadoop days. Most posts will inform you of some solution involving environment variables or passing Java opts to the mappers/reducers. In practice, there is actually a very straightforward solution. To change the Mapper Log4j level, set mapreduce.map.log.level. To change the Reducer Log4j level, set mapreduce.reduce.log.level. If for some reason you need to change the Log4j level on the MapReduce ApplicationMaster (e.g. to debug InputSplit generation), you need to set yarn.app.mapreduce.am.log.level. This is the proper way for the Apache Hadoop 2.x release line. These options do not allow configuration of a Log4j level on a certain class or package -- this would require custom logging setup to be provided by your application. It's important to remember that you are able to define configuration properties (which will appear in your job via the Hadoop Configuration) using the `hadoop jar` command: hadoop jar <jarfile> <classname> [-Dkey=value ...] [arg, ...] The `-Dkey=value` section can be used to define the Log4j configuration properties when you launch the job. For example, to set the DEBUG Log4j level on Mappers: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dmapreduce.map.log.level=DEBUG <args ...> To set the WARN Log4j level on Reducers: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dmapreduce.reduce.log.level=WARN <args ...> To set the DEBUG Log4j level on the MapReduce Application Master: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dyarn.app.mapreduce.am.log.level=DEBUG <args ...> And, of course, each of these options can be used with one another: hadoop jar MyApplication.jar com.myorg.application.ApplicationJob -Dmapreduce.map.log.level=DEBUG -Dmapreduce.reduce.log.level=DEBUG -Dyarn.app.mapreduce.am.log.level=DEBUG <args ...>

elserj · ‎06-21-2016

It looks like you somehow upgraded (some?) HDFS jars and messed up the Hadoop classpath. It could not load expected variables from the classpath. Additionally, it seems like the SecondaryNameNode is reporting that there is a newer filesystem layout (which would imply a newer version of HDFS was at one point running) and that it is expecting an older version (which implies that the SNN is using an older version of HDFS). Make sure you have consistent versions of HDFS installed.

elserj · ‎06-20-2016

Yes, that looks like exactly what happened. 4G is a good heap size to start.

elserj · ‎06-20-2016

"When I try to run the command it gets the below status and when I type jps command I can't see the HRegionServer anymore and I have to restart HBase" Have you looked at the RegionServer log to determine why it is no longer running? It sounds like something is causing your RegionServer to fail (perhaps, out of memory?) and then HBase cannot proceed because it requires at least one RegionServer. Investigate the end of the RegionServer log to determine the failure.

elserj · ‎06-20-2016

There are many situations in which running benchmarks for certain workloads on Apache Phoenix can provide meaningful insight into an installation. Commonly, such a benchmark is very useful to understand the baseline characteristics of a new installation of Apache Phoenix. Alternatively, the ability to re-run the same benchmark after changing a configuration property change is extremely useful in understanding the effect of that change. Many approaches exist to test systems that have a SQL interface, many of them focused on a specific type of workload. The following approaches aim to describe a few benchmarks which users can run on their own and tweak to a workload which makes sense for their cluster. Apache JMeter Automation Apache JMeter is a tool which was initially designed to test web applications; however, it also has the ability to execute SQL queries against some JDBC Driver. JMeter allows us to define queries to execute against a Phoenix table. An instance of JMeter can run multiple threads of queries in parallel, with multiple instances of JMeter capable of spreading clients across many nodes. The queries can also be parameterized with pseudo-random data in order to simulate all types of queries to a table. https://github.com/joshelser/phoenix-performance is a project (originally based on https://github.com/cartershanklin/phoenix-performance and https://github.com/ndimiduk/phoenix-performance) which bulk ingests data to Phoenix and then reads the data back using JMeter. The data-generation is done by TPC-DS and can scale from small to large to generate an appropriate level of data for the cluster being tested. This is accomplished via a MapReduce job which creates HBase HFiles which are then bulk-imported directly into HBase. This approach is the most efficient to ingest a large amount of data into HBase. A number of example queries are also provided which vary in the style of the query, e.g. point queries or range-scan queries. JMeter automates the the execution of the queries in parallel. The results of the queries that JMeter ran are also aggregated and analyzed together to provide an overall view into the performance. Mean and median are provided for a simple insight, as well as 90th, 95th, 99th and 99.9th percentiles to understand the execution tail. This approach is extremely useful to execute read-heavy workloads. Indexes can be created over the original TPC-DS dataset to mimic your real datasets. The provided queries are only a starting point and can be easily expanded to any other type of query. The provided README file gives general instructions to generating and querying the data. Apache Phoenix Pherf Pherf is a tool which Apache Phoenix provides out of the box to test both read and write performance. It also aims to provide some means for verifying correctness, but this feature is a bit lacking, being hard to test correctness in ways other than record counts. Pherf requires two things to run a test: a schema and a scenario. A schema is some SQL file defining DDL (data definition langauge) for some table(s) or index(es). The scenario defines both the write and read tests to execute against those tables defined in the schema. On the write-side, like the JMeter support, Pherf also supports the generation of pseudo-random data to populate into the tables. In addition to purely random data, Pherf also has the ability to specify data to write with given probabilities. The scenario then defines the number of records which should be inserted into the table given the rules on the data generation. On the read-side, Pherf allows the definition of queries and the expected outcome of those queries to run be on the tables which were just populated. Pherf can collect metrics about the scenario being executed, but the results are not aggregated and presented for human consumption. Like the JMeter tests, Pherf can be parallelized across many nodes in a cluster to test the system under concurrent user load. There are many other options available to Pherf. The official documentation can be found at https://phoenix.apache.org/pherf.html. Some automation software which tries to handle the installation and execution of Pherf is also available at https://github.com/joshelser/phoenix-pherf-automation. YCSB The "Yahoo! Cloud Serving Benchmark" https://github.com/brianfrankcooper/YCSB is well-known benchmarking software in the database field. YCSB has many bindings for both SQL and NoSQL databases, commonly being used directly by Apache HBase for performance testing. YCSB has workloads which define how data is written and read from the tables in the database. A workload defines the number of records/operations, the ratio of reads/updates/scans/inserts, and the distribution (e.g. Zipfian) of data to generate. YCSB doesn't provide fine-grained control over the type of data to generate via configuration (like JMeter and Pherf do), but this can be nice to not have to configure (using the provided YCSB workloads as "standard" workloads). Like all of the above, YCSB can be executed on one node or run concurrently across many nodes. The result of the benchmark are reported very similarly to what the JMeter approach does (mean, median, and percentiles), but is probably the most detailed. YCSB does require some modifications to run against Apache Phoenix (as Phoenix doesn't support the traditional "INSERT" command). Long term, this modifications will likely land upstream to ease use of YCSB against Phoenix. Summary In conclusion, there are a number of tools available to use to understand the performance of Apache Phoenix. For any user, having a representative benchmark for your specific workloads is an extremely important tool in running a cluster. These kinds of benchmarks let you evaluate the performance of your cluster as your change application and operating-system configurations. Benchmarks do require a bit of effort to understand what the results report. The results should always be looked at critically to ensure the numbers are sensible and that you understand why the results are what they are. All users, whether new or old, should strongly consider investing time into finding the right benchmark for their Apache Phoenix application if they do not already have one.

elserj · ‎06-18-2016

You should be able to use the ExecuteSQL processor to extract data from Phoenix now. As Ted points out, the write-side is still a TBD.

elserj · ‎06-17-2016

Actually, looks like we might bundle a script you can call for this: /usr/hdp/current/hbase-client/bin/hbase-jruby /usr/hdp/current/hbase-client/bin/get-active-master.rb Not sure what the availability of this script is across older versions of HDP.

elserj · ‎06-17-2016

Hi Gerd, You found the right ZNode used for having one and only one active HBase master at one time. The reason it's "screwed" up is that you're actually seeing a serialized data structure. The contents of that znode is a serialized instance of the org.apache.hadoop.hbase.protobuf.generated.ZooKeeperProtos.Master protocol buffer class. I'm not seeing anything in the HBase shell yet which would be helpful.

Online	Offline
Last Visited	‎07-01-2022 02:44 PM

Member Since	‎07-17-2019 08:58 AM
Last Visited	‎07-01-2022 02:44 PM
Posts	738
Kudos received	429

Cloudera Community

Re: Why can't Object Stores like Amazon S3 be used...

Re: Not a host:port pair: PBUF, how to resolve?

Re: versioning question in hbase

Re: Phoenix query call from java on larger data se...

Re: Revoke permissions to a superuser on Hbase

Re: multiple row key hbase

Re: The node /hbase is not in ZooKeeper. It should...

Changing the Log4j debug with `hadoop jar` in MapR...

Re: HBase regionserver is down during importtsv an...

Re: HBase regionserver is down during importtsv an...

Re: HBase regionserver is down during importtsv an...

Apache Phoenix Performance Testing Tools

Re: NiFi Phoenix processor?

Re: How to fetch active HBase Master node via zkCl...

Re: How to fetch active HBase Master node via zkCl...