Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Fundamental question on Hive metastore

Fundamental question on Hive metastore

New Contributor

Hi,

I am a bit confused around Hive metastore, especially after reading the document around high availability for metastore (http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_hadoop-ha/content/ha-hive-use-and-failover.html).

Is Hive metastore and HCatalog one and the same? If not, when to use what?

Document talks about setting metastore service in HA using hive.metastore.uris parameter and as well configuring the underlying DB in HA. Can someone enlighten me how things work behind the scenes? What really happens by providing a comma separated list of URIs in hive.metastore.uris parameter?

Why is the underlying DB need to be configured for High Availability when the metastore service is already configured for it?

TIA

5 REPLIES 5

Re: Fundamental question on Hive metastore

Super Guru

@bigdata.neophyte

Hive metastore is a service which stores metadata about HCatalog tables. It runs in its own JVM process. Metastore provide clients (beeline, pig, etc) access to metastore api. Metastore persist the metadata to a database. If you enable HA for metastore then it makes sense to have the DB support HA as well.

All the metadata for Hive tables and partitions are accessed through the Hive Metastore. Metadata is persisted using JPOX ORM solution (Data Nucleus) so any database that is supported by it can be used by Hive. Most of the commercial relational databases and many open source databases are supported. See the list of supported databases in section below.

Re: Fundamental question on Hive metastore

New Contributor

Thanks @Sunile Manjee for your response. Apologies for the delayed response.

I am still confused around this topic. Let me take an example. When I create a Hive table using CREATE TABLE command, the table's metadata i.e. information about the table name, column names, their datatypes etc need to be stored / persisted somewhere so that Hive can parse the underlying HDFS data using the metadata.

Am I correct in saying the above statement?

If so, is the persistent storage for the metadata a relational database which you were referring to above? If so, what is the role of HCatalog? Does it simply provide a mechanism for client applications to "read" the metadata already created in the underlying database? What is the role of Metastore service in all this?

Apologies if I misunderstood something very basic here. TIA

Re: Fundamental question on Hive metastore

Super Guru

@bigdata.neophyte When you issue a create table in hive, it persist the metadata to the hive metastore. HCatalog is a built on top of metastore which will provide read & write access to languages such as hive/pig. Without HCatalog hive and pig had to maintain their own metadata repo and no common read & write access for tables maintain by each language. Now with HCatalog built on top of metastore, hive/pig access repos through hcatalog. when a table is created via hive it will be expose to pig through hcatalog. Does that make sense?

Re: Fundamental question on Hive metastore

Super Guru

@bigdata.neophyte Hope it is much clearer now. Are you good?

Re: Fundamental question on Hive metastore

Super Guru

Hi @bigdata.neophyte

I think Sunil has explained well enough. In case you are still confused, I'll try to rephrase this.

First let's talk about Hive Metastore which I believe from your comment to Sunil's answer, you already understand. Basically when you create tables in Hive, you have to specify somewhere, the location of the data files, the file format of data, table name, columns and so on. You need a place to store this information. That place is Hive Metastore. It is some database usually MySQL (or Postgres or Oracle). Now why do you need HA for this metastore db? For the same reason you need HA for anything else. If for some reason your MySQL instance containing Hivemetastore goes down, you want to be able to failover to your standby so your users are not impacted. You also need HA for metastore service because even if DB is working, for some reason your metastore service can fail and again you want to failover to standby without impacting your users.

Now let's talk about HCatalog. When Hive was created, you could run Hive QL which is pretty much SQL on top of your tabular data in Hadoop. This is great. But that is not all where the power of Hadoop lies. One of the most significant difference between Hadoop and traditional platforms is it's ability to run different engines to prosecute your data. So for your tabular/structured data in Hadoop, you can not only create Hive tables and run SQL queries, but you can also read the same data in your map reduce jobs or pig scripts. But how would you do that if there is no HCatalog? You can write custom map reduce jobs to read the structure of the table and custom Pig scripts from Hive metastore. That is what most people did before HCatalog. But with HCatalog, they now have access to the same information that's in hive metastore so they can quickly and easily read those hive tables rather than writing their own custom jobs. Check slide number 4 of the following link and see how Hive can go directly to Hivemetastore but other services need some way to talk to Hive metastore. That way is HCatalog.

http://www.slideshare.net/Hadoop_Summit/future-of-hcatalog

Don't have an account?
Coming from Hortonworks? Activate your account here