Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Contributor

Why Share a Hive Metastore?

 

Many organizations are moving workloads to the cloud to take advantage of the flexibility it offers.

 

A key to this flexibility is a shared, persistent repository of metadata and services. Ephemeral compute clusters scale up and down and connect to this shared service layer. All clusters and metadata services share a unified storage layer.

 

These capabilities are at the core of Cloudera’s next-generation product, the Cloudera Data Platform.

 

Can we deploy this architecture today, with Hortonworks Data Platform 3? 

 

A key piece to this architecture is sharing a single Hive Metastore between all clusters. Hive is HDP’s SQL engine. The Hive metastore contains the metadata which allows services on each cluster to know where and how Hive tables are stored, and access those tables. 

 

Let’s look at our options.

 

Standalone Hive Metastore Service

 

A standalone Hive Metastore Service could be installed on a node outside of the HDP cluster. This configuration is not supported by Cloudera. To be supported, HMS must be installed on an Ambari-managed node within the HDP cluster. 

 

Shared Hive Metastore Service

 

In this configuration, a single cluster is configured with a Hive Metastore Service. Any additional clusters are configured to use the HMS of the first cluster, rather than their own HMS.

 

There are performance trade-offs. The load on the shared HMS in this configuration can reduce performance. Additionally, the fact that the HMS is not local to each cluster can lead to network and other latency.

 

No Hive Metastore Service, Shared RDBMS

 

Recall that the Hive Metastore Service sits on top of an RDBMS which contains the actual metadata.

 

It is possible to configure all clusters to use their local metastore (configure Hive with hive.metastore.uris=<blank>) and share a common RDBMS. Bypassing the Hive Metastore service in this way gives significant performance gains, with some tradeoffs.

 

First, all clusters connecting to the RDBMS must be fully trusted, as they will have un-restricted access to the metastore DB.

 

Second, all clusters must be on the exact same version at all times. Many versions (and even patches) of HDP make changes to the metastore DB schema. Any cluster which connects with a different version can cause significant changes which will impact all other clusters. Ensuring this does not happen is usually done by the Hive Metastore Service, which is not present in this configuration.

 

For further detail, please see this presentation by Yahoo! Japan in which they discuss the performance gains they saw using this architecture. 

 

Summary

 

You can share a Hive Metastore between multiple clusters with HDP 3! There are management, performance, and security trade-offs that must be considered.

 

Please contact Cloudera if you are interested in this type of deployment and would like to discuss further!

1,546 Views