Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Apache Hive - What kind of activities are normal in the Hive

avatar
New Member

Hi, I've been working with Hadoop and testing a lot of components of it ecossystem. Now I'm doing a small project that consists in two phases: a) Data Cleansing b) KPIs defintion The step a) I already do in Apache PIG. Now I load the data to Apache Hive. And thus, as in all other projects that I work I only see Apache Hive as data repository. Basically, I just used the Hive to load the data after data cleansing step and the use it as regular data source, nothing more.

Since 'm very new in Big Data/Hadoop world, I would like to know what kind of jobs/activities are normal to do using Apache Hive. Sorry for the ignorance :) Thanks!

1 ACCEPTED SOLUTION

avatar
Super Guru

@Johnny Fugers

In many scenarios, Hive is used much like an RDBMS, but with better scalability and flexibility. Hive scales to PB of data which is difficult for a typical RDMBS.

One of the big benefits Hive provides is a low barrier to entry for end-users. More specifically, users are able to use standard SQL to interact with the data. One of the most common use cases is to off-load many of the data processing tasks done in a typical RDMBS and have them done in Hive instead. This frees up resources on those systems for more time sensitive tasks.

View solution in original post

2 REPLIES 2

avatar
Super Guru

@Johnny Fugers

In many scenarios, Hive is used much like an RDBMS, but with better scalability and flexibility. Hive scales to PB of data which is difficult for a typical RDMBS.

One of the big benefits Hive provides is a low barrier to entry for end-users. More specifically, users are able to use standard SQL to interact with the data. One of the most common use cases is to off-load many of the data processing tasks done in a typical RDMBS and have them done in Hive instead. This frees up resources on those systems for more time sensitive tasks.

avatar
Master Guru

@Johnny Fugers

Hive is great for typically BI queries. The scalability is limitless. When you get into the area of updates, I rather do those activities on phoenix and serve the end results back to hive for BI queries. Hive ACID is coming soon. Until that is available I would use the phoenix->Hive route. Use PIG for ETL. Where it gets interested is using a MPP database on Hadoop. that is where HAWQ comes in. It is a good low latency db engine which provided you some benefits from both hive and phoenix. It does not do all hive & phoenix capabilities. I would say it is a good happy medium. I hope that helps. When you go further into your journey you will start to ask question about security and governance. For security you will start with ranger & Knox. and goverance you will start with falcon/atlas/ranger.