Support Questions

Find answers, ask questions, and share your expertise

How the hive metastore works

avatar
Rising Star

I have some hive tables with data. And now I will use spark sql to query this data.

But Im not understanding the role of a Hive component in this process, the Hive Metastore.

The Hive Metastore stores all info about the tables. And we can execute spark sql queries because spark can interact with Hive Metastore.

But, how that works, its automatic? I have the hive tables with data, now to execute spark sql queries I need to create the Hive Metastore? Or its automatic? We need to do something?

Im relatively new in Hive and Im not understanding well this concept in this scenario.

Thanks!!

1 ACCEPTED SOLUTION

avatar

@John Cod The Hive Metastore, also referred to as HCatalog is a relational database repository containing metadata about objects you create in Hive. When you create a Hive table, the table definition (column names, data types, comments, etc.) are stored in the Hive Metastore. This is automatic and simply part of the Hive architecture. The reason why the Hive Metastore is critical is because it acts as a central schema repository which can be used by other access tools like Spark and Pig. Additionally, through Hiveserver2 you can access the Hive Metastore using ODBC and JDBC connections. This opens the schema to visualization tools like PowerBi or Tableau.

The only configuration you have to be concerned about is the initial install when you decide what relational database to use. The default is ProgresSQL but for production we recommend using Oracle or a system which is already being backed up and secured. Hope this helps.

View solution in original post

10 REPLIES 10

avatar
Expert Contributor
@John Cod

As given above Hive metastore holds the details related to metadata (columns, datatypes, compression, input and output formats and many more that includes the HDFS location of the table and Database as well) with this information any tools/services that connects with Hive will invoke a NameNode call to get the Metadata (about the files, directories and the corresponding blocks etc) which is pretty much needed for the jobs that will be launched by Hive.