About Enis

Enis · ‎12-14-2015

One of the benefits of using a distribution like HDP is to never have to deal with version mismatch across different components like Hive and HBase. You can look into using HDP artifacts rather than trying to assemble the components together.

Enis · ‎12-09-2015

In HDP deployments, zookeeper is always started as a separate service (rather than HBase managing zookeeper). You can see whether zookeeper is running or not from Ambari or by manually looking at running processes. You can also use the zkCli command to connect to the running zookeeper to inspect its state. However, as pointed out by Josh already, from an Java application talking to HBase, the correct way to configure is to add the hbase's configuration directory (/etc/hbase/conf) to your classpath. You can check https://community.hortonworks.com/articles/4091/hbase-client-application-best-practices.html.

Enis · ‎12-02-2015

HBase master reads the list of files of the regions of tables in a couple of cases: (1) CatalogJanitor process. This runs every hbase.catalogjanitor.interval (5mins by default). This is for garbage collecting regions that have been split or merge. The catalog janitor checks whether the daugther regions (after a split) still has references to the parent region. Once the references are compacted, parent can be deleted. Notice that this process should only access recently split or merged regions. (2) HFile/WAL cleaner. This runs every hbase.master.cleaner.interval (1 min by default). This is for garbage collecting data files (hfiles) and WAL files. Data files in HBase can be referenced by more than one region, table and shared across snapshots and live tables and there is also a minimum time (TTL) that the hfile/WAL will be kept around. That is why the master is responsible for doing reference counting and garbage collecting the data files. This is possibly the most expensive NN operation among the other ones in this list. (3) Region Balancer. The balancer takes locality into account for balancing decisions. That is why the balancer will do file listing to find the locality of blocks of files in the regions. The locality of files is kept in a local cache for (hard coded unfortunately) 240 minutes.

Enis · ‎11-23-2015

Sure, please follow up on the internal ticket.

Enis · ‎11-20-2015

Thanks Artem for bringing this to attention. Yes, that configuration is deprecated and it is simply being ignored in HDP-2.3. I'll raise an internal ticket for documentation update.

Enis · ‎11-18-2015

You should ALWAYS depend on HDP version of the artifacts instead of the Apache version of the artifacts coming from the HDP maven repositories. The dependencies that is resolved from the maven artifacts will be correctly resolved to the exact same version of the artifacts. Remember that, the base versions od HDP client and server jars are only indicative of what commits are in those binaries. However, HDP version of the client jars might contain other fixes that is not available in that particular apache base version. See Item (6) in http://community.hortonworks.com/articles/4091/hbase-client-application-best-practices.html

Enis · ‎11-17-2015

Here are the best practices for writing an HBase client application for HDP. 1. Use the new HBase-1.0 API's instead of old interfaces. Instead of HTable, use Table, instead of HConnection, use Connection, etc. Also the Connection management has been changed so that the connection lifecycle management is best performed by the client application. Check out these slide decks for more examples: https://www.dropbox.com/s/v1x3djtlp1qg204/HBase%201.0%20API%20Changes%20-%20Meetup%2010_15_2014.pdf?dl=0 http://www.slideshare.net/enissoz/meet-hbase-10 As well as further examples here: https://github.com/hortonworks/hbase-release/tree/HDP-2.5.3.0-tag/hbase-examples 2. Always close the Connection, Table, Scanner and Admin interfaces when you are done. These interfaces implement Closeable, and holds resources on the client side or on the server side (Scanner). Thus properly closing these is very important for best performance. Java's new try syntax is an easy way to auto-close these resources in your code: try (Connection connection = ConnectionFactory.createConnection(conf); Admin admin = connection.getAdmin(); Table table = connection.getTable(tableName);) { table.get(new Get(...)) } 3. Make sure to understand creation-cost, lifecycle and thread-safety of Connection, Table and similar interfaces. In short, Connection is thread safe, and very heavy-weight (owns the underlying zookeeper connection, socket connections, etc), thus it should be created once per application and shared across threads. Table, Admin, etc on the other hand are light weight and NOT thread-safe. Check the above links for more documentation for these interfaces. Typically, you would open a Connection and only close that Connection when the application shuts down. Table and Admin objects can be created and closed per request. 4. Use BufferedMutator for streaming / batch Puts. BufferedMutator replaces HTable.setAutoFlush(false) and is the supported high-performance streaming writes API. 5. Make sure that hbase-site.xml is sourced instead of manually calling conf.set() for hbase.zookeeper.quorum, etc. When Configuration conf = HBaseConfiguration.create(); is called HBase looks for the file named "hbase-site.xml" in all of the DIRECTORIES in the classpath. Thus, if the application adds /etc/hbase/conf (which is the default location for HDP) to its classpath at the start, there is no need to manually call conf.set() for client settings. Applications should especially do this, since other client-level configuration settings coming from the Ambari deployment automatically gets picked up by the client application without code change. 6. Make sure to depend on the correct version of client jars. Applications should always depend on the HDP version of the artifacts coming from the HDP repo instead of the Apache version of the artifacts. HDP versions of components are usually mostly binary and wire compatible with the base versions of Apache components. However, there maybe fixes to the client jars that the application would otherwise will not see and hence result in hard to debug cases. See http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.8.0/bk_user-guide/content/user-guide-setup-maven-repo.html for an example. Enis

Enis · ‎10-26-2015

Should be addressed by HBASE-14680.

Enis · ‎10-26-2015

Using WASB as a filesystem backend for HBase has been tested and certified by Hortonworks and MS jointly. However, you should not use it with plain HDP on WASB since some configuration for WAL directories are needed (storing WALs on page store versus storing data files in blob store). I would recommend using HDI.

Enis · ‎10-21-2015

There are two aspects to the question. First, is whether the replication can be controlled inside a region and have the data of a user to live only inside that region. This is possible in theory in a couple of different ways. If we can partition the users by region to different tables, and setup replication of all datacenters within the region, then we have achieved the boundary requirements. Some tables can be replicated to only datacenters within the region, while some other tables will be replicated cross-regions. HBase's replication model is pretty flexible in the sense that, we can do cyclic replication, etc (please read https://hbase.apache.org/book.html#_cluster_replication). If we cannot partition by table, we can still use the same table, but partition by column family (as noted above). Otherwise, we can still respect boundaries, using a recent feature called WALEntryFilter's. The basic idea would be to implement a custom WALEntryFilter which either (a) understands the data and selects which edits (mutations) to send to the receiving side (another geo-region) or (b) tag every edit with the intended regions it should hit and have the WALEntryFilter respect the tags from mutations. The second aspect is whether you can query the whole data set from any region. Of course, if you have some data not leaving its particular geo-region, you cannot have all the data aggregated in a single DC. So the only way to access the data in whole would be to dynamically send the query to all affected geo-regions and merge the results back.

Online	Offline
Last Visited	‎09-26-2022 06:38 PM

Member Since	‎09-29-2015 07:06 PM
Last Visited	‎09-26-2022 06:38 PM
Posts	94
Kudos received	114

Cloudera Community

Re: Need an easy and performant way to purge data ...

Re: hbase job throwing exception

Re: How do Zookeeper save HBase Column and Column ...

Re: MapReduce performance on the HBase input table...

Re: Significance of ScanNext Latency in HBase Metr...

Re: hbase hive integration with hadoop 2.7.1 versi...

Re: Check the value configured in 'zookeeper.znode...

Re: When would HBaseMaster access HBase HDFS files...

Re: hbase.bucketcache.percentage.in.combinedcache ...

Re: hbase.bucketcache.percentage.in.combinedcache ...

Re: HBase client jar requires an older version of ...

HBase client application best practices

Re: Hbase snapshot Timeout exception

Re: HBase using WASB as rootDir

Re: Geographically Distributed HBase