Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Purpose for hive streaming

Highlighted

Purpose for hive streaming

Super Guru

With dynamic partitions and hive acid now GA (HPD 2.5) what is the use case or purpose of hive streaming?

I see the following documentation on apache hive

Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted.  Hive Streaming API allows data to be pumped continuously into Hive.

I am not sure I follow that. Technically when data is generated by application, writing a insert statement into hive is "pumping" data into a hive table. I must be missing something.

1 REPLY 1
Highlighted

Re: Purpose for hive streaming

Expert Contributor

A use case for hive streaming is to provide a storage mechanism for streaming applications, ex:- storm , if you are parsing tweets or real time click data. you could do it two ways, one is collect the data in a staging area, then dump it to hdfs and the run hive/ pig batch operations on them. But , in some case such an architecture is not fast enough. For, example if you have a system that counts tweets by say NFL teams. A good design could be to use storm consume the tweets and in a bolt down the line , parse the team name, increment a counter in a nosql db and persist the original tweet for batch jobs. Advantage is you have a real time system that can give you updates of tweet counts, there is no query to process or job to run , just a look up. For the persistence layer ,you could use a db, and then run nightly jobs that take the data to hive/hdfs. but , if your end destination is to be in hdfs, why keep in a db. Wouldn't it be nice to be able to write this to hdfs directly as the data streams in . That is where the best use case i think for hive streaming is. I am not sure how well it performs, as i have seen people warning about that.

Don't have an account?
Coming from Hortonworks? Activate your account here