- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How the hive metastore works
- Labels:
-
Apache Hadoop
-
Apache Hive
Created ‎03-20-2016 04:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have some hive tables with data. And now I will use spark sql to query this data.
But Im not understanding the role of a Hive component in this process, the Hive Metastore.
The Hive Metastore stores all info about the tables. And we can execute spark sql queries because spark can interact with Hive Metastore.
But, how that works, its automatic? I have the hive tables with data, now to execute spark sql queries I need to create the Hive Metastore? Or its automatic? We need to do something?
Im relatively new in Hive and Im not understanding well this concept in this scenario.
Thanks!!
Created ‎03-20-2016 06:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@John Cod The Hive Metastore, also referred to as HCatalog is a relational database repository containing metadata about objects you create in Hive. When you create a Hive table, the table definition (column names, data types, comments, etc.) are stored in the Hive Metastore. This is automatic and simply part of the Hive architecture. The reason why the Hive Metastore is critical is because it acts as a central schema repository which can be used by other access tools like Spark and Pig. Additionally, through Hiveserver2 you can access the Hive Metastore using ODBC and JDBC connections. This opens the schema to visualization tools like PowerBi or Tableau.
The only configuration you have to be concerned about is the initial install when you decide what relational database to use. The default is ProgresSQL but for production we recommend using Oracle or a system which is already being backed up and secured. Hope this helps.
Created ‎03-20-2016 04:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @John Cod ,
a good starting point for diving into Hive is: https://cwiki.apache.org/confluence/display/Hive/Home.
If you install your cluster including Hive, the Hive Metastore will be installed as well...it is more or less "the brain"
HTH, Gerd
Created ‎03-20-2016 05:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help, I already read that documentation, but I still have the doubt, because that link talks more about metastore configuration, but Im with conceptual doubts. Maybe I didnt explain well my doubt, We need the hive metastore to run queries with spark because spark will use that metastore to execute queries. But we need to configure that metastore? I already have the tables and data in tables in hive and I did not nothing about metastore. But I need to do? My doubt is really this: Im not understaning how this communication between spark and hive with metastore works and what we need to do that its not automatic...
Created ‎03-20-2016 05:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's a document to explain the integration http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_spark-guide/content/ch_accessing-hive-tab...
Created ‎10-24-2018 10:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link is down
Created ‎03-20-2016 06:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@John Cod The Hive Metastore, also referred to as HCatalog is a relational database repository containing metadata about objects you create in Hive. When you create a Hive table, the table definition (column names, data types, comments, etc.) are stored in the Hive Metastore. This is automatic and simply part of the Hive architecture. The reason why the Hive Metastore is critical is because it acts as a central schema repository which can be used by other access tools like Spark and Pig. Additionally, through Hiveserver2 you can access the Hive Metastore using ODBC and JDBC connections. This opens the schema to visualization tools like PowerBi or Tableau.
The only configuration you have to be concerned about is the initial install when you decide what relational database to use. The default is ProgresSQL but for production we recommend using Oracle or a system which is already being backed up and secured. Hope this helps.
Created ‎03-20-2016 10:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, now its more clear!
Created ‎03-28-2016 12:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I like to think of the meta store as the definition of the structure you want to impose on the unstructured data that lives on HDFS. Not only column names and types but where rows and columns start/end on the format being queried.
Created ‎07-20-2017 04:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for the information, as you said Hive metastore is used by other tools like spark and pig..so does that mean without hive can't we use spark and pig to access the data?
Created ‎10-24-2018 10:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this further documented somewhere ?
