Archives of Support Questions (Read Only)

m2014227 · ‎08-09-2016

Hi, I've been working with Hadoop and testing a lot of components of it ecossystem. Now I'm doing a small project that consists in two phases: a) Data Cleansing b) KPIs defintion The step a) I already do in Apache PIG. Now I load the data to Apache Hive. And thus, as in all other projects that I work I only see Apache Hive as data repository. Basically, I just used the Hive to load the data after data cleansing step and the use it as regular data source, nothing more.

Since 'm very new in Big Data/Hadoop world, I would like to know what kind of jobs/activities are normal to do using Apache Hive. Sorry for the ignorance :) Thanks!

myoung · ‎08-09-2016

@Johnny Fugers

In many scenarios, Hive is used much like an RDBMS, but with better scalability and flexibility. Hive scales to PB of data which is difficult for a typical RDMBS.

One of the big benefits Hive provides is a low barrier to entry for end-users. More specifically, users are able to use standard SQL to interact with the data. One of the most common use cases is to off-load many of the data processing tasks done in a typical RDMBS and have them done in Hive instead. This frees up resources on those systems for more time sensitive tasks.

View solution in original post

myoung · ‎08-09-2016

@Johnny Fugers

In many scenarios, Hive is used much like an RDBMS, but with better scalability and flexibility. Hive scales to PB of data which is difficult for a typical RDMBS.

One of the big benefits Hive provides is a low barrier to entry for end-users. More specifically, users are able to use standard SQL to interact with the data. One of the most common use cases is to off-load many of the data processing tasks done in a typical RDMBS and have them done in Hive instead. This frees up resources on those systems for more time sensitive tasks.

sunile_manjee · ‎08-09-2016

@Johnny Fugers

Hive is great for typically BI queries. The scalability is limitless. When you get into the area of updates, I rather do those activities on phoenix and serve the end results back to hive for BI queries. Hive ACID is coming soon. Until that is available I would use the phoenix->Hive route. Use PIG for ETL. Where it gets interested is using a MPP database on Hadoop. that is where HAWQ comes in. It is a good low latency db engine which provided you some benefits from both hive and phoenix. It does not do all hive & phoenix capabilities. I would say it is a good happy medium. I hope that helps. When you go further into your journey you will start to ask question about security and governance. For security you will start with ranger & Knox. and goverance you will start with falcon/atlas/ranger.

Cloudera Community

Archives of Support Questions (Read Only)

Apache Hive - What kind of activities are normal in the Hive