Many organizations are moving workloads to the cloud to take advantage of the flexibility it offers.
A key to this flexibility is a shared, persistent repository of metadata and services. Ephemeral compute clusters scale up and down and connect to this shared service layer. All clusters and metadata services share a unified storage layer.
Can we deploy this architecture today, with Hortonworks Data Platform 3?
A key piece to this architecture is sharing a single Hive Metastore between all clusters. Hive is HDP’s SQL engine. The Hive metastore contains the metadata which allows services on each cluster to know where and how Hive tables are stored, and access those tables.
Let’s look at our options.
Standalone Hive Metastore Service
A standalone Hive Metastore Service could be installed on a node outside of the HDP cluster. This configuration is not supported by Cloudera. To be supported, HMS must be installed on an Ambari-managed node within the HDP cluster.
Shared Hive Metastore Service
In this configuration, a single cluster is configured with a Hive Metastore Service. Any additional clusters are configured to use the HMS of the first cluster, rather than their own HMS.
There are performance trade-offs. The load on the shared HMS in this configuration can reduce performance. Additionally, the fact that the HMS is not local to each cluster can lead to network and other latency.
No Hive Metastore Service, Shared RDBMS
Recall that the Hive Metastore Service sits on top of an RDBMS which contains the actual metadata.
It is possible to configure all clusters to use their local metastore (configure Hive with hive.metastore.uris=<blank>) and share a common RDBMS. Bypassing the Hive Metastore service in this way gives significant performance gains, with some tradeoffs.
First, all clusters connecting to the RDBMS must be fully trusted, as they will have un-restricted access to the metastore DB.
Second, all clusters must be on the exact same version at all times. Many versions (and even patches) of HDP make changes to the metastore DB schema. Any cluster which connects with a different version can cause significant changes which will impact all other clusters. Ensuring this does not happen is usually done by the Hive Metastore Service, which is not present in this configuration.