Created on 11-09-202303:00 PM - edited 11-09-202303:06 PM
Overview
Purpose of this article is to explain a few ways to access metadata of data objects in Cloudera's Hive & Impala data warehouses.
Before we get into the specific details, let's understand a typical method to access metadata and data objects in Cloudera Data Platform.
Hive Metastore (HMS) database holds themetadataof all Hive & Impala data objects. When user/application is outside the Cloudera cluster, thrift protocol is used to access it. Hive & Impala data warehouses hold all thedata objects. When user/application is outside the Cloudera cluster, Cloudera's JDBC/ODBCdriversare used to access it.
As you know Cloudera excels at Hybrid architecture and there is a consistent method of accessing metadata and data across different deployment models. Let's now understand how it works.
A single HMS database is available per environment (environment refers to CDP's On-Premise or Public Cloud installation), to hold information about all the Hive/Impala objects. In other words, regardless of where Hive/Impala objects are created (ex: CDW Data Service), their metadata is going to be available in the single HMS instance that's associated with the environment.
To access HMS database outside the cluster in both CDP On-Prem and Public Cloud, thrift protocol is used along with the Kerberos authentication mechanism. We must configure cross-realm Kerberos trust between the user/application and Cloudera environment, so they're able to communicate with each other.
Ways to access metadata
Below are a few ways to access metadata of data objects in Cloudera's Hive & Impala data warehouses --
Hive Metastore (HMS)
HMS database holds themetadataof all Hive & Impala data objects.
If you're accessing it outside the Cloudera cluster, use thrift protocol and ensure cross-realm Kerberos trust is configured between the client (user/application) and Cloudera cluster. This is typically done by adding krb5.conf to the client.
If you're accessing it within the Cloudera cluster, use the HMS instance that you setup during CDP On-Premise installation or if you're in Public Cloud then use the instructionshere.
sys database
sysdatabase mirrors HMS database and resides in Hive data warehouse. Hive data warehouse is accessed using Cloudera's JDBC/ODBC drivers. Note that even thoughsysdatabase is only available through Hive data warehouse, it contains the metadata of both Hive & Impala data objects. For reference, see an example query below to retrieve the metadata.
-- See a few details about a subset of tables
select b.name, a.tbl_name, a.owner, a.tbl_type, c.location, a.view_original_text, a.view_expanded_text
from sys.tbls a
join sys.dbs b on b.db_id = a.db_id
join sys.sds c on c.sd_id = a.sd_id
where tbl_name like '%tmp_%';
Tip: keepthis HMS schemahandy when querying the sys database.
Apache Atlas API
Apache Atlasis a governance tool that lets you see the metadata of all data assets, lineage of data and much more.REST APIsare exposed to retrieve the necessary information remotely. There is alsoPython library for Apache Atlasavailable on pypi. Below is a common curl example of accesssing Atlas in Cloudera using REST API:
Note that full Atlas API URL is available in the Endpoints section of the datalake. See below for reference.
Summary
Glad you made it this far. Even with the complexity that comes with the Hybrid Architecture, Cloudera has made it simple to access the metadata of data objects in Cloudera's data warehouses. There are multiple options available as discussed above, that customers & partners can choose from based on their requirements.