Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

When to use Hive and Hbase?

When to use Hive and Hbase?

Contributor

When should you use hive or hbase?

With this sample dataset: sample.txt

Header Value: cluster_num, test_num, part_id, param_id, site, result, standard dev, anomaly

6 REPLIES 6

Re: When to use Hive and Hbase?

@mel mendoza

Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned.

HBase is perfect for real-time querying of Big Data.HBase should be used when –

  • There is large amount of data.
  • ACID properties are not mandatory but just required.
  • Data model schema is sparse.
  • When your applications needs to scale gracefully.

Re: When to use Hive and Hbase?

New Contributor

@mel mendoza I do not understand your statement about ACID properties. Are they required or not in the use case you're writing about? Thanks in advance for clarifying.

Re: When to use Hive and Hbase?

Contributor

@Sridhar Reddy since my requirements is real-time querying i should use hbase. Thanks!

Re: When to use Hive and Hbase?

Contributor

@Frank Welsch - that file was the aggregated by spark, and based on our requirement, real-time querying is needed. all the value inside the file is required.

Highlighted

Re: When to use Hive and Hbase?

Contributor

the next question is how to model that kind of data in hbase.

Re: When to use Hive and Hbase?

New Contributor
282down voteaccepted

Hi,

MapReduce is a computing framework. HBase nothing to do with it. That said, you can efficiently fetch data from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.

Thanks

Hari